Analysis of Childcare Costs in Relation to Family Income and Employment Data¶
Prepared by: Matthew Walker, Neysha Pagán Vargas
Date: Decmber 15, 2024
INTRODUCTION¶
Childcare costs have become a critical concern for families across various income levels, impacting financial stability and employment decisions. This project, Analysis of Childcare Costs in Relation to Family Income and Employment Data, aims to explore the intricate relationships between childcare expenses, family income, and employment dynamics. By developing a robust data engineering pipeline, the project seeks to provide actionable insights into how childcare affordability varies across demographics, geographic regions, and employment statuses.
The primary goal of this project is to assess how childcare costs influence financial stability and employment opportunities, especially for women in the workforce. Using a combination of static datasets and API-sourced data, the project will build a scalable and efficient data processing system that ensures data accuracy and accessibility. The analysis will focus on key metrics, including childcare affordability as a percentage of family income, regional disparities, and correlations between employment rates and childcare burdens.
The project explores multiple database approaches to ensure optimal data storage, retrieval, and analysis. While MongoDB is utilized for its flexibility and scalability in handling semi-structured data, Cassandra is considered for its high availability and write performance across distributed environments, making it suitable for large-scale, geographically dispersed data. Additionally, Elasticsearch is included as an alternative for its full-text search capabilities and fast querying, enabling advanced data exploration and trend analysis.
This notebook will guide users through the development and implementation of the data pipeline, utilizing tools like PySpark for data transformation, MinIO for object storage, and various database technologies (MongoDB, Cassandra, and Elasticsearch) for data querying and analysis. Visualizations generated in Jupyter Notebooks and Kibana will provide a clear depiction of trends and correlations, enabling stakeholders to understand the challenges and opportunities in childcare affordability.
By combining data engineering best practices with detailed analysis and a diversified database approach, this project aims to uncover meaningful insights and inform policy recommendations for making childcare more accessible and equitable.
PROJECT OVERVIEW¶
This project aims to develop a data engineering pipeline to ingest, process, and analyze data on childcare costs alongside employment and family income data. The objective is to assess how childcare costs impact family financial stability, focusing on factors such as employment rates, income distribution, and affordability of childcare across different states and counties. This proposal will outline the key datasets, potential analyses, and the data engineering methodologies involved in constructing an effective data pipeline.
Dataset Descriptions¶
All data is from 2018¶
Childcare Prices as a Share of Median Family Income
- Source: Department of Labor (DOL)
- URL: Median Family Income by Age and Care Setting
- Description:
- Provides data on childcare costs for different care settings (center-based and home-based) and age groups (infant, toddler, preschool, and school-age).
- Includes county-level information on childcare prices as a share of median family income.
- Key variables:
- Care settings (e.g., center-based, home-based)
- Age groups
- Median family income
- Percentage of income spent on childcare.
Labor Force Statistics: Employment and Earnings
- Source: Bureau of Labor Statistics (BLS)
- URL: CPS Annual Averages for 2018
- Dataset Name: 8. Employed and unemployed full- and part-time workers by age, sex, race, and Hispanic or Latino ethnicity
- Description:
- Provides detailed labor force statistics, focusing on employment rates and earnings.
- Worksheets used:
- CPSAAT08: Employed and unemployed full- and part-time workers by age, sex, race, and Hispanic or Latino ethnicity .
- Key variables:
- Employment status (full-time, part-time)
- Median wages
- Demographic breakdowns (age, gender, and occupation).
Median Family Income Data
- Source: National Center for Education Statistics (NCES)
- URL: Median Income by State
- Description:
- Contains state-level averages for median family income.
- Similar to the DOL dataset but aggregated at the state level, losing granularity at the county level.
- Potential redundancy, as it mirrors data from the DOL.
Techniques and Methodologies¶
This project prioritizes the development and implementation of robust data engineering practices to handle and process large-scale datasets related to childcare costs, family income, and employment statistics. The emphasis is on building a scalable and efficient pipeline that ensures data accuracy, reliability, and readiness for analysis. While some visualizations will be included to summarize results, the primary focus is to demonstrate the end-to-end data engineering process. Below is a detailed outline of the techniques and methodologies employed:
1. Data Ingestion¶
- Sources:
- CSV and Excel files from Department of Labor (DOL), US Census Bureau (USCB), and National Center for Education Statistics (NCES).
- Worksheet (CPSAAT08) from the Bureau of Labor Statistics (BLS).
- Tools:
- MinIO Object Storage:
- Used to store raw files in an organized manner.
- Files are partitioned by source and year for efficient retrieval.
- PySpark:
- Reads and ingests datasets into Spark DataFrames for distributed processing.
- MinIO Object Storage:
2. Data Preprocessing¶
- Handling Missing and Inconsistent Data:
- Impute missing values where applicable.
- Remove or standardize invalid entries.
- Standardization:
- Ensure consistent formats for numeric fields (e.g., median income, childcare costs).
- Standardize categorical fields (e.g., state names, care settings).
- Schema Validation:
- Use schema enforcement during data ingestion to validate data types and structures.
- Reject invalid rows and log issues for auditability.
3. Data Transformation¶
- ETL Pipeline:
- Extract:
- Load raw datasets from MinIO into PySpark DataFrames.
- Transform:
- Join datasets on common fields (e.g., state, county, year).
- Compute derived metrics such as:
- Childcare costs as a percentage of median family income.
- Employment rates correlated with childcare affordability.
- Aggregate data to state and county levels for comparison.
- Load:
- Save the cleaned and transformed data into partitioned collections in MongoDB, Cassandra, and Elasticsearch.
- Extract:
- Feature Engineering:
- Create additional columns for demographic aggregations and geographic comparisons.
- Calculate trends over time to support historical analysis.
4. Data Storage¶
- Databases Used:
- Cassandra:
- For scalable, high-availability storage of structured data.
- Supports efficient querying for region-specific and demographic analyses.
- MongoDB:
- To handle semi-structured data, such as demographic and childcare details.
- Allows flexible querying of nested fields for detailed insights.
- Elasticsearch:
- Provides fast querying capabilities for text-based data exploration and search-driven analysis.
- Cassandra:
- Partitioning:
- Data is partitioned by region (state, county) and year to optimize storage and retrieval performance.
5. Visualization¶
- Purpose:
- To verify data transformations and provide summaries of key metrics.
- Tools:
- Matplotlib and Plotly:
- Create visualizations such as:
- Geographic distributions of childcare costs as a percentage of income.
- Correlations between employment rates and childcare costs.
- Trend lines for historical comparisons.
- Create visualizations such as:
- Matplotlib and Plotly:
- Scope:
- Minimal visualizations, primarily used to validate the results of the pipeline and illustrate key findings.
Outcome¶
The project will produce a reliable, scalable data engineering pipeline that ingests, cleans, and transforms datasets into a unified, queryable format. By focusing on data engineering, the project ensures that the data is ready for downstream applications, including policy analysis, detailed reporting, and decision-making. Visualizations serve as supplementary outputs, while the primary deliverable is the robust infrastructure built to process and manage complex datasets.
LIBRARIES¶
pip install boto3
Requirement already satisfied: boto3 in /opt/conda/lib/python3.9/site-packages (1.35.81) Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.9/site-packages (from boto3) (1.0.1) Requirement already satisfied: s3transfer<0.11.0,>=0.10.0 in /opt/conda/lib/python3.9/site-packages (from boto3) (0.10.4) Requirement already satisfied: botocore<1.36.0,>=1.35.81 in /opt/conda/lib/python3.9/site-packages (from boto3) (1.35.81) Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.9/site-packages (from botocore<1.36.0,>=1.35.81->boto3) (2.8.2) Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.9/site-packages (from botocore<1.36.0,>=1.35.81->boto3) (1.26.6) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.36.0,>=1.35.81->boto3) (1.16.0) Note: you may need to restart the kernel to use updated packages.
pip install openpyxl
Requirement already satisfied: openpyxl in /opt/conda/lib/python3.9/site-packages (3.1.5) Requirement already satisfied: et-xmlfile in /opt/conda/lib/python3.9/site-packages (from openpyxl) (2.0.0) Note: you may need to restart the kernel to use updated packages.
pip install -q cassandra-driver
Note: you may need to restart the kernel to use updated packages.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import os
import subprocess
from urllib.request import Request, urlopen, urlretrieve
from urllib.error import HTTPError
# install if needed: pip install boto3 , openpyxl
import boto3
from openpyxl import load_workbook
from pyspark.sql.functions import lit
import pandas as pd
from IPython.display import display
import re
from pyspark.sql.functions import col
import json
from cassandra.cluster import Cluster
USER DEFINED FUNCTIONS¶
# 1. Function to sanitize a single column name
def sanitize_column(name):
"""
Sanitizes a column name by replacing special characters and standardizing the format.
- Replaces '%' with '_percent' and '$' with '_dollar'.
- Converts spaces and special characters into underscores.
- Strips leading/trailing underscores and converts to lowercase.
"""
# Replace '%' and '$' with meaningful labels
name = name.replace('%', '_percent')
name = name.replace('$', '_dollar')
# Replace special characters with underscores
name = re.sub(r'[^\w\s]', '_', name) # Retain alphanumeric and spaces
# Replace spaces with underscores and collapse multiple underscores
name = re.sub(r'\s+', '_', name)
name = re.sub(r'_+', '_', name)
# Remove leading/trailing underscores and convert to lowercase
return name.strip('_').lower()
# 2. Function to apply sanitization to all columns in a DataFrame
def sanitize_dataframe_columns(dataframe):
"""
Sanitizes all column names in a Pandas DataFrame.
"""
sanitized_columns = [sanitize_column(col) for col in dataframe.columns]
dataframe.columns = sanitized_columns
return dataframe
PART I: Data Ingestion¶
The datasets for this project are retrieved directly from publicly available sources, including government and organizational websites. Each dataset is downloaded programmatically or manually (if necessary) from its respective source link. The data files are then stored locally before being uploaded to MinIO, which serves as the centralized storage solution for this project.
MINIO Storage¶
MinIO is used as the landing storage for all data sources.
Connect to MinIO from Jupyter Terminal
Create new bucket for the project
Verify new bucket
Download the datasets and upload them to MinIO
# Define dataset URLs and filenames
datasets = [
("https://nces.ed.gov/programs/digest/d22/tables/xls/tabn102.30.xlsx", "tabn102.30.xlsx"),
("https://www.dol.gov/sites/dolgov/files/WB/NDCP/Childcare-Prices-by-Age-Children-Care-Setting-Data-Table-2023.xlsx", "Childcare-Prices-2023.xlsx"),
("https://www.bls.gov/cps/aa2018/cpsaat08.xlsx", "cpsaat08.xlsx"),
]
# Manual download files
manual_files = [
("https://data.census.gov/table/ACSST1Y2018.S2002?t=Income%20and%20Poverty&g=010XX00US,$0400000&y=2018",
"ACSST1Y2018.S2002-2024-11-10T090744.csv")
]
# Directory to store downloaded files
download_dir = "datasets"
os.makedirs(download_dir, exist_ok=True)
# Download datasets
print("Downloading datasets...")
for url, filename in datasets:
local_path = os.path.join(download_dir, filename)
if not os.path.exists(local_path):
try:
# Set headers
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) # Use direct import of Request
with urlopen(req) as response, open(local_path, 'wb') as out_file: # Use direct import of urlopen
out_file.write(response.read())
print(f"Downloaded: {filename}")
except HTTPError as e:
print(f"Failed to download {filename}: {e}. Please download it manually.")
else:
print(f"Already exists: {filename}")
# Manual download notification
print("\nManual download required for the following files:")
for _, filename in manual_files:
print(f"- {filename}. Please place it in {download_dir}.")
# Configure MinIO
minio_alias = "ms"
minio_bucket = "project"
# Set MinIO alias
print("\nConfiguring MinIO...")
subprocess.run(
["mc", "alias", "set", minio_alias, "http://minio:9000", "minio", "SU2orange!"],
check=True
)
# Upload datasets to MinIO
print("\nUploading datasets to MinIO bucket 'project'...")
all_files = datasets + manual_files
for _, filename in all_files:
local_path = os.path.join(download_dir, filename)
if os.path.exists(local_path): # Check if the file exists
try:
subprocess.run(
["mc", "cp", local_path, f"{minio_alias}/{minio_bucket}/"],
check=True,
)
print(f"Uploaded: {filename}")
except subprocess.CalledProcessError as e:
print(f"Failed to upload {filename}: {e}")
else:
print(f"File not found: {filename}. Skipping upload.")
print("\nProcess completed!")
Note¶
The following files could not be downloaded programmatically due to access restrictions (HTTP Error 403) and had to be downloaded manually:
cpsaat08.xlsxfrom the Bureau of Labor Statistics (BLS)- URL: BLS CPSAAT08 File
ACSST1Y2018.S2002-2024-11-10T090744.csvfrom the U.S. Census Bureau- URL: Census Data
To ensure proper processing:
- Visit the respective URLs and download the files manually.
- Move the files into the directory
/home/jovyan/datasets.
Once the files are in the correct directory, they will be included in subsequent data uploads to the MinIO bucket.
- Verify the datasets in the bucket through MinIO webconsole
Dataset #1: Childcare Prices as a Share of Median Family Income¶
Problem Statement¶
The "Childcare Prices as a Share of Median Family Income" dataset contains detailed information on childcare costs across various U.S. states and counties. However, the dataset requires significant preprocessing to address missing values, ensure consistent column naming, and prepare it for efficient storage and querying. Without these preprocessing steps, the dataset would be prone to errors and challenging to analyze effectively, limiting its usability for decision-making and analytics.
Data Processing¶
Handle Missing Values:
- Identify columns with missing values and replace them with zero to maintain numeric consistency for calculations and analytics.
Standardize Column Names:
- Sanitize column names by removing problematic characters (spaces, special symbols) while retaining meaningful information, such as $ for dollar values and % for percentages.
- Apply a consistent naming format (e.g., snake_case) for easier querying and analytics.
Validate Data Consistency:
- Group data by states (Reference_SheetName) and ensure proper counts for each category.
- Ensure data completeness and accuracy of data across rows and columns.
Prepare Data for Querying:
- Clean numeric columns to eliminate invalid values.
- Create an optimized, structured dataset ready for database insertion.
Data Storage¶
Cassandra is chosen as the preferred storage solution due to the follwing advantages:
- High Scalability: Cassandra can handle large datasets efficiently, which is ideal for storing multi-state and multi-county data across the U.S.
- Distributed Data Model: Its distributed architecture ensures quick access and querying, even for geographically distributed analytics
- Efficient Querying: The schema can be designed with partition keys (e.g., State, County) to enable fast retrieval of state or county-level information.
- Write-Optimized: Cassandra is optimized for write-heavy operations, making it suitable for inserting large volumes of cleaned data.
- Denormalized Data: Cassandra’s table structure aligns well with flattened, denormalized data, where all relevant information can be queried from a single table.
Note: Childcare prices are derived from each state's childcare Market Rate Survey. Prices are median yearly prices for one child at the market rate. School-age prices reflect the school-year arrangement (part day). Childcare prices are based on the 2016-2018 data collection cycle and are presented in 2018 and 2023 real dollars using the CPI-U for child care (day care and preschool in the U.S. city average). Economic and demographic data are obtained from the 2014-2018 American Community Survey to correspond to the 2016-2018 childcare price data.
PART I: Ingest data from MinIO into pySpark dataframe¶
The following script automates the process of downloading an Excel file from a MinIO object storage bucket, reading all its sheets, and consolidating them into a single Spark DataFrame. Each sheet is processed individually, with its name added as a Reference_SheetName column to distinguish records from different sheets. The resulting unified DataFrame is well-suited for further analysis and transformation in Spark.
While an optional step to save this data in Parquet format is included—highlighting its benefits such as efficient storage, faster retrieval, and optimized query performance in big data environments—this project will not utilize Parquet files. However, this approach is commonly applied in professional, work-related scenarios where performance and scalability are critical.
# MinIO Configuration
s3_url = "http://minio:9000"
s3_key = "minio"
s3_secret = "SU2orange!"
s3_bucket = "project"
s3_file_key = "Childcare-Prices-2023.xlsx"
# Initialize MinIO client
s3 = boto3.client('s3',
endpoint_url=s3_url,
aws_access_key_id=s3_key,
aws_secret_access_key=s3_secret)
# Download Excel file locally from MinIO
local_file = "/tmp/Childcare-Prices-2023.xlsx"
s3.download_file(s3_bucket, s3_file_key, local_file)
# Read sheet names using pandas
sheet_names = pd.ExcelFile(local_file).sheet_names
print("Sheets in the Excel file:", sheet_names)
# Initialize Spark Session
spark = SparkSession.builder \
.appName("ExcelSheetsAnalysis") \
.config("spark.sql.debug.maxToStringFields", "100") \
.getOrCreate()
# Process each sheet and add to a unified DataFrame
final_df = None
for sheet in sheet_names:
try:
print(f"Loading sheet: {sheet}")
# Read the sheet using pandas
pandas_df = pd.read_excel(local_file, sheet_name=sheet)
# Add the State_Name column
pandas_df['Reference_SheetName'] = sheet
# Convert pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
# Combine DataFrames
if final_df is None:
final_df = spark_df
else:
final_df = final_df.unionByName(spark_df)
print(f"Loaded sheet '{sheet}' with {spark_df.count()} rows and {len(spark_df.columns)} columns.")
except Exception as e:
print(f"Failed to load sheet '{sheet}': {e}")
# Show the combined DataFrame
if final_df:
print("Combined DataFrame:")
#final_df.show()
final_df.printSchema()
else:
print("No data was successfully loaded.")
Sheets in the Excel file: ['AllStates', 'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'DistrictOfColumbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'NewHampshire', 'NewJersey', 'NewMexico', 'NewYork', 'NorthCarolina', 'NorthDakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'RhodeIsland', 'SouthCarolina', 'SouthDakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'WestVirginia', 'Wisconsin', 'Wyoming'] Loading sheet: AllStates Loaded sheet 'AllStates' with 3142 rows and 37 columns. Loading sheet: Alabama Loaded sheet 'Alabama' with 67 rows and 37 columns. Loading sheet: Alaska Loaded sheet 'Alaska' with 29 rows and 37 columns. Loading sheet: Arizona Loaded sheet 'Arizona' with 15 rows and 37 columns. Loading sheet: Arkansas Loaded sheet 'Arkansas' with 75 rows and 37 columns. Loading sheet: California Loaded sheet 'California' with 58 rows and 37 columns. Loading sheet: Colorado Loaded sheet 'Colorado' with 64 rows and 37 columns. Loading sheet: Connecticut Loaded sheet 'Connecticut' with 8 rows and 37 columns. Loading sheet: Delaware Loaded sheet 'Delaware' with 3 rows and 37 columns. Loading sheet: DistrictOfColumbia Loaded sheet 'DistrictOfColumbia' with 1 rows and 37 columns. Loading sheet: Florida Loaded sheet 'Florida' with 67 rows and 37 columns. Loading sheet: Georgia Loaded sheet 'Georgia' with 159 rows and 37 columns. Loading sheet: Hawaii Loaded sheet 'Hawaii' with 5 rows and 37 columns. Loading sheet: Idaho Loaded sheet 'Idaho' with 44 rows and 37 columns. Loading sheet: Illinois Loaded sheet 'Illinois' with 102 rows and 37 columns. Loading sheet: Indiana Loaded sheet 'Indiana' with 92 rows and 37 columns. Loading sheet: Iowa Loaded sheet 'Iowa' with 99 rows and 37 columns. Loading sheet: Kansas Loaded sheet 'Kansas' with 105 rows and 37 columns. Loading sheet: Kentucky Loaded sheet 'Kentucky' with 120 rows and 37 columns. Loading sheet: Louisiana Loaded sheet 'Louisiana' with 64 rows and 37 columns. Loading sheet: Maine Loaded sheet 'Maine' with 16 rows and 37 columns. Loading sheet: Maryland Loaded sheet 'Maryland' with 24 rows and 37 columns. Loading sheet: Massachusetts Loaded sheet 'Massachusetts' with 14 rows and 37 columns. Loading sheet: Michigan Loaded sheet 'Michigan' with 83 rows and 37 columns. Loading sheet: Minnesota Loaded sheet 'Minnesota' with 87 rows and 37 columns. Loading sheet: Mississippi Loaded sheet 'Mississippi' with 82 rows and 37 columns. Loading sheet: Missouri Loaded sheet 'Missouri' with 115 rows and 37 columns. Loading sheet: Montana Loaded sheet 'Montana' with 56 rows and 37 columns. Loading sheet: Nebraska Loaded sheet 'Nebraska' with 93 rows and 37 columns. Loading sheet: Nevada Loaded sheet 'Nevada' with 17 rows and 37 columns. Loading sheet: NewHampshire Loaded sheet 'NewHampshire' with 10 rows and 37 columns. Loading sheet: NewJersey Loaded sheet 'NewJersey' with 21 rows and 37 columns. Loading sheet: NewMexico Loaded sheet 'NewMexico' with 33 rows and 37 columns. Loading sheet: NewYork Loaded sheet 'NewYork' with 62 rows and 37 columns. Loading sheet: NorthCarolina Loaded sheet 'NorthCarolina' with 100 rows and 37 columns. Loading sheet: NorthDakota Loaded sheet 'NorthDakota' with 53 rows and 37 columns. Loading sheet: Ohio Loaded sheet 'Ohio' with 88 rows and 37 columns. Loading sheet: Oklahoma Loaded sheet 'Oklahoma' with 77 rows and 37 columns. Loading sheet: Oregon Loaded sheet 'Oregon' with 36 rows and 37 columns. Loading sheet: Pennsylvania Loaded sheet 'Pennsylvania' with 67 rows and 37 columns. Loading sheet: RhodeIsland Loaded sheet 'RhodeIsland' with 5 rows and 37 columns. Loading sheet: SouthCarolina Loaded sheet 'SouthCarolina' with 46 rows and 37 columns. Loading sheet: SouthDakota Loaded sheet 'SouthDakota' with 66 rows and 37 columns. Loading sheet: Tennessee Loaded sheet 'Tennessee' with 95 rows and 37 columns. Loading sheet: Texas Loaded sheet 'Texas' with 254 rows and 37 columns. Loading sheet: Utah Loaded sheet 'Utah' with 29 rows and 37 columns. Loading sheet: Vermont Loaded sheet 'Vermont' with 14 rows and 37 columns. Loading sheet: Virginia Loaded sheet 'Virginia' with 133 rows and 37 columns. Loading sheet: Washington Loaded sheet 'Washington' with 39 rows and 37 columns. Loading sheet: WestVirginia Loaded sheet 'WestVirginia' with 55 rows and 37 columns. Loading sheet: Wisconsin Loaded sheet 'Wisconsin' with 72 rows and 37 columns. Loading sheet: Wyoming Loaded sheet 'Wyoming' with 23 rows and 37 columns. Combined DataFrame: root |-- State name: string (nullable = true) |-- County name: string (nullable = true) |-- County FIPS code: long (nullable = true) |-- Infant center-based price: 2018 ($): double (nullable = true) |-- Infant center-based price: 2023 (estimated) ($): double (nullable = true) |-- Infant center-based price as share of family income: 2018 (%): double (nullable = true) |-- Infant home-based price: 2018 ($): double (nullable = true) |-- Infant home-based price: 2023 (estimated) ($): double (nullable = true) |-- Infant home-based price as share of family income: 2018 (%): double (nullable = true) |-- Toddler center-based price: 2018 ($): double (nullable = true) |-- Toddler center-based price: 2023 (estimated) ($): double (nullable = true) |-- Toddler center-based price as share of family income: 2018 (%): double (nullable = true) |-- Toddler home-based price: 2018 ($): double (nullable = true) |-- Toddler home-based price: 2023 (estimated) ($): double (nullable = true) |-- Toddler home-based price as share of family income: 2018 (%): double (nullable = true) |-- Preschool center-based price: 2018 ($): double (nullable = true) |-- Preschool center-based price: 2023 (estimated) ($): double (nullable = true) |-- Preschool center-based price as share of family income: 2018 (%): double (nullable = true) |-- Preschool home-based price: 2018 ($): double (nullable = true) |-- Preschool home-based price: 2023 (estimated) ($): double (nullable = true) |-- Preschool home-based price as share of family income: 2018 (%): double (nullable = true) |-- School-age center-based price: 2018 ($): double (nullable = true) |-- School-age center-based price: 2023 (estimated) ($): double (nullable = true) |-- School-age center-based price as share of family income: 2018 (%): double (nullable = true) |-- School-age home-based price: 2018 ($): double (nullable = true) |-- School-age home-based price: 2023 (estimated) ($): double (nullable = true) |-- School-age home-based price as share of family income: 2018 (%): double (nullable = true) |-- Women's labor force participation rate (%): double (nullable = true) |-- Women's median earnings ($): double (nullable = true) |-- Median family income ($): double (nullable = true) |-- Percent of families in poverty (%): double (nullable = true) |-- Total population: long (nullable = true) |-- Percent Asian (%): double (nullable = true) |-- Percent Black (%): double (nullable = true) |-- Percent Hispanic (of any race) (%): double (nullable = true) |-- Percent White (%): double (nullable = true) |-- Reference_SheetName: string (nullable = true)
# Converting dataframe to Pandas
childcare_df = final_df.toPandas()
PART II: Dataset Preprocessing, Transformation, and Storage¶
This will include cleaning data, handling missing values, We'll ensure the data quality and integrity are maintained to ensure reliable analysis.
Description This code performs several essential data exploration and preprocessing tasks on the DataFrames.
Purpose To gather comprehensive information about the DataFrame's structure, statistical summary, and data quality, which is crucial for effective data analysis and preprocessing.
Get Shape and Unique Values¶
# Check unique values in the Reference_SheetName
print(childcare_df['Reference_SheetName'].unique())
print(" ")
# Group by 'Reference_SheetName' and count rows
row_counts = childcare_df.groupby('Reference_SheetName').size().reset_index(name='Total_Rows')
# Sort the results by 'Total_Rows' in descending order
row_counts = row_counts.sort_values(by='Total_Rows', ascending=False)
# Display the result
print(row_counts)
# Get Data Frame shape
#print(f"DataFrame shape: {childcare_df.shape}")
print(" ")
# Display DataFrame info
childcare_df.info()
['AllStates' 'Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado' 'Connecticut' 'Delaware' 'DistrictOfColumbia' 'Florida' 'Georgia' 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky' 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota' 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'NewHampshire' 'NewJersey' 'NewMexico' 'NewYork' 'NorthCarolina' 'NorthDakota' 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'RhodeIsland' 'SouthCarolina' 'SouthDakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia' 'Washington' 'WestVirginia' 'Wisconsin' 'Wyoming'] Reference_SheetName Total_Rows 2 AllStates 3142 44 Texas 254 11 Georgia 159 47 Virginia 133 18 Kentucky 120 26 Missouri 115 17 Kansas 105 14 Illinois 102 34 NorthCarolina 100 16 Iowa 99 43 Tennessee 95 28 Nebraska 93 15 Indiana 92 36 Ohio 88 24 Minnesota 87 23 Michigan 83 25 Mississippi 82 37 Oklahoma 77 4 Arkansas 75 50 Wisconsin 72 39 Pennsylvania 67 0 Alabama 67 10 Florida 67 42 SouthDakota 66 6 Colorado 64 19 Louisiana 64 33 NewYork 62 5 California 58 27 Montana 56 49 WestVirginia 55 35 NorthDakota 53 41 SouthCarolina 46 13 Idaho 44 48 Washington 39 38 Oregon 36 32 NewMexico 33 1 Alaska 29 45 Utah 29 21 Maryland 24 51 Wyoming 23 31 NewJersey 21 29 Nevada 17 20 Maine 16 3 Arizona 15 46 Vermont 14 22 Massachusetts 14 30 NewHampshire 10 7 Connecticut 8 40 RhodeIsland 5 12 Hawaii 5 8 Delaware 3 9 DistrictOfColumbia 1 <class 'pandas.core.frame.DataFrame'> RangeIndex: 6284 entries, 0 to 6283 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State name 6284 non-null object 1 County name 6284 non-null object 2 County FIPS code 6284 non-null int64 3 Infant center-based price: 2018 ($) 5636 non-null float64 4 Infant center-based price: 2023 (estimated) ($) 5636 non-null float64 5 Infant center-based price as share of family income: 2018 (%) 5636 non-null float64 6 Infant home-based price: 2018 ($) 5596 non-null float64 7 Infant home-based price: 2023 (estimated) ($) 5596 non-null float64 8 Infant home-based price as share of family income: 2018 (%) 5596 non-null float64 9 Toddler center-based price: 2018 ($) 5636 non-null float64 10 Toddler center-based price: 2023 (estimated) ($) 5636 non-null float64 11 Toddler center-based price as share of family income: 2018 (%) 5636 non-null float64 12 Toddler home-based price: 2018 ($) 5596 non-null float64 13 Toddler home-based price: 2023 (estimated) ($) 5596 non-null float64 14 Toddler home-based price as share of family income: 2018 (%) 5596 non-null float64 15 Preschool center-based price: 2018 ($) 5636 non-null float64 16 Preschool center-based price: 2023 (estimated) ($) 5636 non-null float64 17 Preschool center-based price as share of family income: 2018 (%) 5636 non-null float64 18 Preschool home-based price: 2018 ($) 5596 non-null float64 19 Preschool home-based price: 2023 (estimated) ($) 5596 non-null float64 20 Preschool home-based price as share of family income: 2018 (%) 5596 non-null float64 21 School-age center-based price: 2018 ($) 5636 non-null float64 22 School-age center-based price: 2023 (estimated) ($) 5636 non-null float64 23 School-age center-based price as share of family income: 2018 (%) 5636 non-null float64 24 School-age home-based price: 2018 ($) 5596 non-null float64 25 School-age home-based price: 2023 (estimated) ($) 5596 non-null float64 26 School-age home-based price as share of family income: 2018 (%) 5596 non-null float64 27 Women's labor force participation rate (%) 6284 non-null float64 28 Women's median earnings ($) 6284 non-null float64 29 Median family income ($) 6284 non-null float64 30 Percent of families in poverty (%) 6284 non-null float64 31 Total population 6284 non-null int64 32 Percent Asian (%) 6284 non-null float64 33 Percent Black (%) 6284 non-null float64 34 Percent Hispanic (of any race) (%) 6284 non-null float64 35 Percent White (%) 6284 non-null float64 36 Reference_SheetName 6284 non-null object dtypes: float64(32), int64(2), object(3) memory usage: 1.8+ MB
Get summary statistics¶
# Get summary statistics for the DataFrame
print("\nSummary Statistics:")
childcare_df.describe()
Summary Statistics:
| County FIPS code | Infant center-based price: 2018 ($) | Infant center-based price: 2023 (estimated) ($) | Infant center-based price as share of family income: 2018 (%) | Infant home-based price: 2018 ($) | Infant home-based price: 2023 (estimated) ($) | Infant home-based price as share of family income: 2018 (%) | Toddler center-based price: 2018 ($) | Toddler center-based price: 2023 (estimated) ($) | Toddler center-based price as share of family income: 2018 (%) | Toddler home-based price: 2018 ($) | Toddler home-based price: 2023 (estimated) ($) | Toddler home-based price as share of family income: 2018 (%) | Preschool center-based price: 2018 ($) | Preschool center-based price: 2023 (estimated) ($) | Preschool center-based price as share of family income: 2018 (%) | Preschool home-based price: 2018 ($) | Preschool home-based price: 2023 (estimated) ($) | Preschool home-based price as share of family income: 2018 (%) | School-age center-based price: 2018 ($) | School-age center-based price: 2023 (estimated) ($) | School-age center-based price as share of family income: 2018 (%) | School-age home-based price: 2018 ($) | School-age home-based price: 2023 (estimated) ($) | School-age home-based price as share of family income: 2018 (%) | Women's labor force participation rate (%) | Women's median earnings ($) | Median family income ($) | Percent of families in poverty (%) | Total population | Percent Asian (%) | Percent Black (%) | Percent Hispanic (of any race) (%) | Percent White (%) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 6284.000000 | 5636.000000 | 5636.000000 | 5636.000000 | 5596.000000 | 5596.000000 | 5596.000000 | 5636.000000 | 5636.000000 | 5636.000000 | 5596.000000 | 5596.000000 | 5596.000000 | 5636.000000 | 5636.000000 | 5636.000000 | 5596.000000 | 5596.000000 | 5596.000000 | 5636.000000 | 5636.000000 | 5636.000000 | 5596.000000 | 5596.000000 | 5596.000000 | 6284.000000 | 6284.000000 | 6284.000000 | 6284.000000 | 6.284000e+03 | 6284.000000 | 6284.000000 | 6284.000000 | 6284.000000 |
| mean | 30383.649268 | 8241.184469 | 9700.303909 | 12.910712 | 6390.337971 | 7521.761058 | 10.145962 | 7454.270260 | 8774.064847 | 11.723498 | 6042.372316 | 7112.187334 | 9.608443 | 6943.904437 | 8173.337657 | 10.950372 | 5886.502211 | 6928.720092 | 9.358120 | 5817.488021 | 6847.486791 | 9.219063 | 5252.829916 | 6182.854754 | 8.384417 | 69.505952 | 24485.186139 | 64062.194213 | 11.238447 | 1.027699e+05 | 1.373456 | 9.059771 | 9.262826 | 83.055761 |
| std | 15161.301696 | 2880.559603 | 3390.568878 | 3.209779 | 1828.062093 | 2151.724419 | 2.322280 | 2363.327871 | 2781.760155 | 2.655799 | 1654.629996 | 1947.585797 | 2.132269 | 2121.796762 | 2497.465443 | 2.455602 | 1616.341033 | 1902.517690 | 2.094380 | 1961.018409 | 2308.220937 | 2.576315 | 1553.794978 | 1828.897722 | 2.203071 | 7.812962 | 4793.042660 | 15863.117635 | 5.664165 | 3.298815e+05 | 2.814414 | 14.529875 | 13.789355 | 16.869443 |
| min | 1001.000000 | 1441.960000 | 1697.262120 | 2.422200 | 2478.320000 | 2917.111888 | 4.608798 | 1120.080000 | 1318.392574 | 1.881507 | 2442.440000 | 2874.879257 | 4.228521 | 1120.080000 | 1318.392574 | 1.881507 | 2081.560000 | 2450.104676 | 3.797497 | 986.960000 | 1161.703391 | 1.657893 | 1721.200000 | 2025.942163 | 2.723350 | 33.600000 | 7336.000000 | 21816.000000 | 0.000000 | 7.500000e+01 | 0.000000 | 0.000000 | 0.000000 | 3.900000 |
| 25% | 18177.000000 | 6487.000000 | 7635.537306 | 10.810312 | 5105.333220 | 6009.243450 | 8.627784 | 5982.599917 | 7041.832103 | 10.001134 | 4827.414260 | 5682.118341 | 8.129308 | 5509.400000 | 6484.851123 | 9.327111 | 4722.640000 | 5558.793572 | 7.947756 | 4622.918118 | 5441.415716 | 7.572226 | 4290.000000 | 5049.553730 | 6.960867 | 64.400000 | 21614.000000 | 53547.000000 | 7.300000 | 1.094800e+04 | 0.300000 | 0.700000 | 2.100000 | 76.600000 |
| 50% | 29176.000000 | 7800.000000 | 9181.006781 | 12.704673 | 5999.760000 | 7062.030416 | 9.939216 | 7163.000000 | 8431.224561 | 11.511692 | 5720.000000 | 6732.738306 | 9.470798 | 6500.000000 | 7650.838984 | 10.777722 | 5634.100688 | 6631.630336 | 9.250211 | 5460.006855 | 6426.712815 | 9.189917 | 4907.256732 | 5776.097094 | 8.261921 | 70.000000 | 23926.000000 | 62128.000000 | 10.300000 | 2.573600e+04 | 0.600000 | 2.300000 | 4.100000 | 89.650000 |
| 75% | 45081.000000 | 9360.000000 | 11017.208137 | 14.608065 | 7331.480000 | 8629.534307 | 11.424713 | 8320.000000 | 9793.073900 | 13.227062 | 6795.880000 | 7999.105175 | 10.795409 | 7852.000000 | 9242.213493 | 12.299005 | 6614.691406 | 7785.836750 | 10.542621 | 6943.560000 | 8172.932237 | 10.844965 | 6240.000000 | 7344.805425 | 9.689398 | 75.100000 | 26582.000000 | 71366.000000 | 13.900000 | 6.725300e+04 | 1.300000 | 10.200000 | 9.600000 | 95.100000 |
| max | 56045.000000 | 24440.000000 | 28767.154581 | 47.003046 | 22408.880000 | 26376.420415 | 23.361513 | 21788.000000 | 25645.612275 | 36.467839 | 19568.640000 | 23033.309812 | 22.539479 | 20020.000000 | 23564.584072 | 33.457612 | 17229.680000 | 20280.231912 | 24.428166 | 19520.800000 | 22976.999638 | 27.322515 | 15060.240000 | 17726.687893 | 19.576390 | 95.200000 | 62919.000000 | 178542.000000 | 52.100000 | 1.009805e+07 | 42.500000 | 87.400000 | 99.100000 | 100.000000 |
Check Null values¶
# Check for missing values in the DataFrame
print("\nMissing Values:")
childcare_df.isnull().sum()
Missing Values:
State name 0 County name 0 County FIPS code 0 Infant center-based price: 2018 ($) 648 Infant center-based price: 2023 (estimated) ($) 648 Infant center-based price as share of family income: 2018 (%) 648 Infant home-based price: 2018 ($) 688 Infant home-based price: 2023 (estimated) ($) 688 Infant home-based price as share of family income: 2018 (%) 688 Toddler center-based price: 2018 ($) 648 Toddler center-based price: 2023 (estimated) ($) 648 Toddler center-based price as share of family income: 2018 (%) 648 Toddler home-based price: 2018 ($) 688 Toddler home-based price: 2023 (estimated) ($) 688 Toddler home-based price as share of family income: 2018 (%) 688 Preschool center-based price: 2018 ($) 648 Preschool center-based price: 2023 (estimated) ($) 648 Preschool center-based price as share of family income: 2018 (%) 648 Preschool home-based price: 2018 ($) 688 Preschool home-based price: 2023 (estimated) ($) 688 Preschool home-based price as share of family income: 2018 (%) 688 School-age center-based price: 2018 ($) 648 School-age center-based price: 2023 (estimated) ($) 648 School-age center-based price as share of family income: 2018 (%) 648 School-age home-based price: 2018 ($) 688 School-age home-based price: 2023 (estimated) ($) 688 School-age home-based price as share of family income: 2018 (%) 688 Women's labor force participation rate (%) 0 Women's median earnings ($) 0 Median family income ($) 0 Percent of families in poverty (%) 0 Total population 0 Percent Asian (%) 0 Percent Black (%) 0 Percent Hispanic (of any race) (%) 0 Percent White (%) 0 Reference_SheetName 0 dtype: int64
# Check for missing values in the Pandas DataFrame
missing_values = childcare_df.isnull().sum()
# Filter columns with null values
columns_with_nulls = missing_values[missing_values > 0].index.tolist()
# Check for null values grouped by 'Reference_SheetName'
null_summary = childcare_df.groupby('Reference_SheetName').apply(
lambda group: group[columns_with_nulls].isnull().sum()
)
# Filter only rows where null values exist
null_summary = null_summary.loc[:, (null_summary > 0).any()]
# Set Pandas display options to show all rows
pd.set_option('display.max_rows', len(null_summary))
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
# Display the DataFrame as a table
display(null_summary)
# Optionally reset display options
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.width')
| Infant center-based price: 2018 ($) | Infant center-based price: 2023 (estimated) ($) | Infant center-based price as share of family income: 2018 (%) | Infant home-based price: 2018 ($) | Infant home-based price: 2023 (estimated) ($) | Infant home-based price as share of family income: 2018 (%) | Toddler center-based price: 2018 ($) | Toddler center-based price: 2023 (estimated) ($) | Toddler center-based price as share of family income: 2018 (%) | Toddler home-based price: 2018 ($) | Toddler home-based price: 2023 (estimated) ($) | Toddler home-based price as share of family income: 2018 (%) | Preschool center-based price: 2018 ($) | Preschool center-based price: 2023 (estimated) ($) | Preschool center-based price as share of family income: 2018 (%) | Preschool home-based price: 2018 ($) | Preschool home-based price: 2023 (estimated) ($) | Preschool home-based price as share of family income: 2018 (%) | School-age center-based price: 2018 ($) | School-age center-based price: 2023 (estimated) ($) | School-age center-based price as share of family income: 2018 (%) | School-age home-based price: 2018 ($) | School-age home-based price: 2023 (estimated) ($) | School-age home-based price as share of family income: 2018 (%) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Reference_SheetName | ||||||||||||||||||||||||
| Alabama | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Alaska | 22 | 22 | 22 | 28 | 28 | 28 | 22 | 22 | 22 | 28 | 28 | 28 | 22 | 22 | 22 | 28 | 28 | 28 | 22 | 22 | 22 | 28 | 28 | 28 |
| AllStates | 324 | 324 | 324 | 344 | 344 | 344 | 324 | 324 | 324 | 344 | 344 | 344 | 324 | 324 | 324 | 344 | 344 | 344 | 324 | 324 | 324 | 344 | 344 | 344 |
| Arizona | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Arkansas | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| California | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Colorado | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 |
| Connecticut | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Delaware | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| DistrictOfColumbia | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| Florida | 2 | 2 | 2 | 3 | 3 | 3 | 2 | 2 | 2 | 3 | 3 | 3 | 2 | 2 | 2 | 3 | 3 | 3 | 2 | 2 | 2 | 3 | 3 | 3 |
| Georgia | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Hawaii | 4 | 4 | 4 | 1 | 1 | 1 | 4 | 4 | 4 | 1 | 1 | 1 | 4 | 4 | 4 | 1 | 1 | 1 | 4 | 4 | 4 | 1 | 1 | 1 |
| Idaho | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Illinois | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Indiana | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 | 92 |
| Iowa | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Kansas | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Kentucky | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Louisiana | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Maine | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Maryland | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Massachusetts | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Michigan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Minnesota | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Mississippi | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Missouri | 100 | 100 | 100 | 110 | 110 | 110 | 100 | 100 | 100 | 110 | 110 | 110 | 100 | 100 | 100 | 110 | 110 | 110 | 100 | 100 | 100 | 110 | 110 | 110 |
| Montana | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Nebraska | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Nevada | 2 | 2 | 2 | 9 | 9 | 9 | 2 | 2 | 2 | 9 | 9 | 9 | 2 | 2 | 2 | 9 | 9 | 9 | 2 | 2 | 2 | 9 | 9 | 9 |
| NewHampshire | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| NewJersey | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| NewMexico | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 |
| NewYork | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| NorthCarolina | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 2 |
| NorthDakota | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Ohio | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Oklahoma | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Oregon | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pennsylvania | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| RhodeIsland | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| SouthCarolina | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| SouthDakota | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tennessee | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Texas | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Utah | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Vermont | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Virginia | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Washington | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WestVirginia | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Wisconsin | 3 | 3 | 3 | 1 | 1 | 1 | 3 | 3 | 3 | 1 | 1 | 1 | 3 | 3 | 3 | 1 | 1 | 1 | 3 | 3 | 3 | 1 | 1 | 1 |
| Wyoming | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Handling Missing Values¶
This script creates a clean version of the childcare dataset by replacing all missing values in columns with 0. It ensures the data is ready for analysis by handling null values in numerical columns, which could otherwise disrupt calculations or visualizations. The original dataset remains unchanged, and a new DataFrame is created for further use.
# Create a new DataFrame to preserve the original data
cleaned_dataframe = childcare_df.copy()
# Replace missing values with 0 for all columns that contain missing values
columns_with_nulls = cleaned_dataframe.columns[cleaned_dataframe.isnull().any()]
# Iterate through columns and replace NaN with 0
for column in columns_with_nulls:
cleaned_dataframe[column].fillna(0, inplace=True)
# Display the updated DataFrame and confirm no nulls exist
print("Updated DataFrame with missing values replaced:")
print(cleaned_dataframe.isnull().sum()) # Check if all nulls are replaced
Updated DataFrame with missing values replaced: State name 0 County name 0 County FIPS code 0 Infant center-based price: 2018 ($) 0 Infant center-based price: 2023 (estimated) ($) 0 Infant center-based price as share of family income: 2018 (%) 0 Infant home-based price: 2018 ($) 0 Infant home-based price: 2023 (estimated) ($) 0 Infant home-based price as share of family income: 2018 (%) 0 Toddler center-based price: 2018 ($) 0 Toddler center-based price: 2023 (estimated) ($) 0 Toddler center-based price as share of family income: 2018 (%) 0 Toddler home-based price: 2018 ($) 0 Toddler home-based price: 2023 (estimated) ($) 0 Toddler home-based price as share of family income: 2018 (%) 0 Preschool center-based price: 2018 ($) 0 Preschool center-based price: 2023 (estimated) ($) 0 Preschool center-based price as share of family income: 2018 (%) 0 Preschool home-based price: 2018 ($) 0 Preschool home-based price: 2023 (estimated) ($) 0 Preschool home-based price as share of family income: 2018 (%) 0 School-age center-based price: 2018 ($) 0 School-age center-based price: 2023 (estimated) ($) 0 School-age center-based price as share of family income: 2018 (%) 0 School-age home-based price: 2018 ($) 0 School-age home-based price: 2023 (estimated) ($) 0 School-age home-based price as share of family income: 2018 (%) 0 Women's labor force participation rate (%) 0 Women's median earnings ($) 0 Median family income ($) 0 Percent of families in poverty (%) 0 Total population 0 Percent Asian (%) 0 Percent Black (%) 0 Percent Hispanic (of any race) (%) 0 Percent White (%) 0 Reference_SheetName 0 dtype: int64
Standardizing Column Names for Consistent and Readable Data¶
Standardizing column names is crucial in data preprocessing to ensure that they are descriptive, programmatically friendly, and consistent across the dataset. This process removes problematic characters, applies a uniform format, and retains key information such as dollar signs ($) and percentage signs (%) for clarity.
Standardized column names:
- Improve programmatic compatibility, making it easier to reference columns in scripts and queries.
- Enhance consistency, reducing potential errors caused by variations in naming conventions.
- Retain semantic meaning, ensuring column names remain informative (e.g., keeping $ and %).
The provided script automates this sanitization process, transforming column names into a uniform, clear format while retaining critical information for financial and percentage-related data.
# Apply the function on the `cleaned_dataframe`
cleaned_dataframe = sanitize_dataframe_columns(cleaned_dataframe)
# Display sanitized column names
print("Sanitized Column Names:")
print(cleaned_dataframe.columns)
Sanitized Column Names:
Index(['state_name', 'county_name', 'county_fips_code',
'infant_center_based_price_2018_dollar',
'infant_center_based_price_2023_estimated_dollar',
'infant_center_based_price_as_share_of_family_income_2018_percent',
'infant_home_based_price_2018_dollar',
'infant_home_based_price_2023_estimated_dollar',
'infant_home_based_price_as_share_of_family_income_2018_percent',
'toddler_center_based_price_2018_dollar',
'toddler_center_based_price_2023_estimated_dollar',
'toddler_center_based_price_as_share_of_family_income_2018_percent',
'toddler_home_based_price_2018_dollar',
'toddler_home_based_price_2023_estimated_dollar',
'toddler_home_based_price_as_share_of_family_income_2018_percent',
'preschool_center_based_price_2018_dollar',
'preschool_center_based_price_2023_estimated_dollar',
'preschool_center_based_price_as_share_of_family_income_2018_percent',
'preschool_home_based_price_2018_dollar',
'preschool_home_based_price_2023_estimated_dollar',
'preschool_home_based_price_as_share_of_family_income_2018_percent',
'school_age_center_based_price_2018_dollar',
'school_age_center_based_price_2023_estimated_dollar',
'school_age_center_based_price_as_share_of_family_income_2018_percent',
'school_age_home_based_price_2018_dollar',
'school_age_home_based_price_2023_estimated_dollar',
'school_age_home_based_price_as_share_of_family_income_2018_percent',
'women_s_labor_force_participation_rate_percent',
'women_s_median_earnings_dollar', 'median_family_income_dollar',
'percent_of_families_in_poverty_percent', 'total_population',
'percent_asian_percent', 'percent_black_percent',
'percent_hispanic_of_any_race_percent', 'percent_white_percent',
'reference_sheetname'],
dtype='object')
print(cleaned_dataframe.dtypes)
# Convert Pandas DataFrame to PySpark DataFrame
spark_df = spark.createDataFrame(cleaned_dataframe)
# Print the schema
spark_df.printSchema()
state_name object county_name object county_fips_code int64 infant_center_based_price_2018_dollar float64 infant_center_based_price_2023_estimated_dollar float64 infant_center_based_price_as_share_of_family_income_2018_percent float64 infant_home_based_price_2018_dollar float64 infant_home_based_price_2023_estimated_dollar float64 infant_home_based_price_as_share_of_family_income_2018_percent float64 toddler_center_based_price_2018_dollar float64 toddler_center_based_price_2023_estimated_dollar float64 toddler_center_based_price_as_share_of_family_income_2018_percent float64 toddler_home_based_price_2018_dollar float64 toddler_home_based_price_2023_estimated_dollar float64 toddler_home_based_price_as_share_of_family_income_2018_percent float64 preschool_center_based_price_2018_dollar float64 preschool_center_based_price_2023_estimated_dollar float64 preschool_center_based_price_as_share_of_family_income_2018_percent float64 preschool_home_based_price_2018_dollar float64 preschool_home_based_price_2023_estimated_dollar float64 preschool_home_based_price_as_share_of_family_income_2018_percent float64 school_age_center_based_price_2018_dollar float64 school_age_center_based_price_2023_estimated_dollar float64 school_age_center_based_price_as_share_of_family_income_2018_percent float64 school_age_home_based_price_2018_dollar float64 school_age_home_based_price_2023_estimated_dollar float64 school_age_home_based_price_as_share_of_family_income_2018_percent float64 women_s_labor_force_participation_rate_percent float64 women_s_median_earnings_dollar float64 median_family_income_dollar float64 percent_of_families_in_poverty_percent float64 total_population int64 percent_asian_percent float64 percent_black_percent float64 percent_hispanic_of_any_race_percent float64 percent_white_percent float64 reference_sheetname object dtype: object root |-- state_name: string (nullable = true) |-- county_name: string (nullable = true) |-- county_fips_code: long (nullable = true) |-- infant_center_based_price_2018_dollar: double (nullable = true) |-- infant_center_based_price_2023_estimated_dollar: double (nullable = true) |-- infant_center_based_price_as_share_of_family_income_2018_percent: double (nullable = true) |-- infant_home_based_price_2018_dollar: double (nullable = true) |-- infant_home_based_price_2023_estimated_dollar: double (nullable = true) |-- infant_home_based_price_as_share_of_family_income_2018_percent: double (nullable = true) |-- toddler_center_based_price_2018_dollar: double (nullable = true) |-- toddler_center_based_price_2023_estimated_dollar: double (nullable = true) |-- toddler_center_based_price_as_share_of_family_income_2018_percent: double (nullable = true) |-- toddler_home_based_price_2018_dollar: double (nullable = true) |-- toddler_home_based_price_2023_estimated_dollar: double (nullable = true) |-- toddler_home_based_price_as_share_of_family_income_2018_percent: double (nullable = true) |-- preschool_center_based_price_2018_dollar: double (nullable = true) |-- preschool_center_based_price_2023_estimated_dollar: double (nullable = true) |-- preschool_center_based_price_as_share_of_family_income_2018_percent: double (nullable = true) |-- preschool_home_based_price_2018_dollar: double (nullable = true) |-- preschool_home_based_price_2023_estimated_dollar: double (nullable = true) |-- preschool_home_based_price_as_share_of_family_income_2018_percent: double (nullable = true) |-- school_age_center_based_price_2018_dollar: double (nullable = true) |-- school_age_center_based_price_2023_estimated_dollar: double (nullable = true) |-- school_age_center_based_price_as_share_of_family_income_2018_percent: double (nullable = true) |-- school_age_home_based_price_2018_dollar: double (nullable = true) |-- school_age_home_based_price_2023_estimated_dollar: double (nullable = true) |-- school_age_home_based_price_as_share_of_family_income_2018_percent: double (nullable = true) |-- women_s_labor_force_participation_rate_percent: double (nullable = true) |-- women_s_median_earnings_dollar: double (nullable = true) |-- median_family_income_dollar: double (nullable = true) |-- percent_of_families_in_poverty_percent: double (nullable = true) |-- total_population: long (nullable = true) |-- percent_asian_percent: double (nullable = true) |-- percent_black_percent: double (nullable = true) |-- percent_hispanic_of_any_race_percent: double (nullable = true) |-- percent_white_percent: double (nullable = true) |-- reference_sheetname: string (nullable = true)
# Display the DataFrame
display(cleaned_dataframe)
| state_name | county_name | county_fips_code | infant_center_based_price_2018_dollar | infant_center_based_price_2023_estimated_dollar | infant_center_based_price_as_share_of_family_income_2018_percent | infant_home_based_price_2018_dollar | infant_home_based_price_2023_estimated_dollar | infant_home_based_price_as_share_of_family_income_2018_percent | toddler_center_based_price_2018_dollar | ... | women_s_labor_force_participation_rate_percent | women_s_median_earnings_dollar | median_family_income_dollar | percent_of_families_in_poverty_percent | total_population | percent_asian_percent | percent_black_percent | percent_hispanic_of_any_race_percent | percent_white_percent | reference_sheetname | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | Autauga County | 1001 | 6261.32 | 7369.900177 | 8.515327 | 5569.20 | 6555.238842 | 7.574051 | 6261.32 | ... | 66.2 | 26243.0 | 73530.0 | 12.0 | 55200 | 1.0 | 19.1 | 2.8 | 76.9 | AllStates |
| 1 | Alabama | Baldwin County | 1003 | 6261.32 | 7369.900177 | 8.702200 | 5613.92 | 6607.876614 | 7.802421 | 6261.32 | ... | 70.1 | 25308.0 | 71951.0 | 7.3 | 208107 | 0.8 | 9.5 | 4.5 | 86.3 | AllStates |
| 2 | Alabama | Barbour County | 1005 | 4482.40 | 5276.018564 | 10.109385 | 4247.36 | 4999.364226 | 9.579287 | 4482.40 | ... | 60.8 | 20396.0 | 44339.0 | 21.5 | 25782 | 0.4 | 47.6 | 4.3 | 47.4 | AllStates |
| 3 | Alabama | Bibb County | 1007 | 5451.16 | 6416.299606 | 9.940117 | 4529.72 | 5331.716671 | 8.259883 | 5451.16 | ... | 59.5 | 22416.0 | 54840.0 | 10.5 | 22527 | 0.2 | 22.3 | 2.4 | 76.7 | AllStates |
| 4 | Alabama | Blount County | 1009 | 9260.16 | 10899.691251 | 15.464013 | 5660.72 | 6662.962655 | 9.453124 | 9260.16 | ... | 56.3 | 27579.0 | 59882.0 | 10.2 | 57645 | 0.3 | 1.5 | 9.1 | 95.5 | AllStates |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6279 | Wyoming | Sweetwater County | 56037 | 7235.80 | 8516.913957 | 8.524540 | 6885.84 | 8104.992786 | 8.112250 | 6524.44 | ... | 71.2 | 22290.0 | 84882.0 | 10.0 | 44117 | 0.6 | 0.8 | 16.0 | 93.1 | Wyoming |
| 6280 | Wyoming | Teton County | 56039 | 20751.12 | 24425.150440 | 18.763672 | 22408.88 | 26376.420415 | 20.262659 | 19022.64 | ... | 81.2 | 34836.0 | 110592.0 | 2.4 | 23059 | 1.2 | 1.2 | 14.9 | 90.3 | Wyoming |
| 6281 | Wyoming | Uinta County | 56041 | 5625.36 | 6621.342091 | 7.962631 | 5379.40 | 6331.834343 | 7.614478 | 5071.04 | ... | 66.2 | 19981.0 | 70647.0 | 11.0 | 20609 | 0.1 | 0.1 | 9.1 | 93.4 | Wyoming |
| 6282 | Wyoming | Washakie County | 56043 | 5760.04 | 6779.867474 | 8.643388 | 5504.20 | 6478.730452 | 8.259480 | 5196.88 | ... | 70.4 | 22195.0 | 66641.0 | 8.0 | 8129 | 0.0 | 0.0 | 14.2 | 89.7 | Wyoming |
| 6283 | Wyoming | Weston County | 56045 | 6561.36 | 7723.062904 | 8.695495 | 6266.00 | 7375.408781 | 8.304067 | 5924.36 | ... | 67.7 | 24406.0 | 75457.0 | 12.6 | 7100 | 4.4 | 0.5 | 1.5 | 92.5 | Wyoming |
6284 rows × 37 columns
spark_df = spark_df.withColumnRenamed("women_s_labor_force_participation_rate_percent","womens_labor_force_participation_rate_percent")
spark_df = spark_df.withColumnRenamed("women_s_median_earnings_dollar", "womens_median_earnings_dollar")
display(spark_df)
DataFrame[state_name: string, county_name: string, county_fips_code: bigint, infant_center_based_price_2018_dollar: double, infant_center_based_price_2023_estimated_dollar: double, infant_center_based_price_as_share_of_family_income_2018_percent: double, infant_home_based_price_2018_dollar: double, infant_home_based_price_2023_estimated_dollar: double, infant_home_based_price_as_share_of_family_income_2018_percent: double, toddler_center_based_price_2018_dollar: double, toddler_center_based_price_2023_estimated_dollar: double, toddler_center_based_price_as_share_of_family_income_2018_percent: double, toddler_home_based_price_2018_dollar: double, toddler_home_based_price_2023_estimated_dollar: double, toddler_home_based_price_as_share_of_family_income_2018_percent: double, preschool_center_based_price_2018_dollar: double, preschool_center_based_price_2023_estimated_dollar: double, preschool_center_based_price_as_share_of_family_income_2018_percent: double, preschool_home_based_price_2018_dollar: double, preschool_home_based_price_2023_estimated_dollar: double, preschool_home_based_price_as_share_of_family_income_2018_percent: double, school_age_center_based_price_2018_dollar: double, school_age_center_based_price_2023_estimated_dollar: double, school_age_center_based_price_as_share_of_family_income_2018_percent: double, school_age_home_based_price_2018_dollar: double, school_age_home_based_price_2023_estimated_dollar: double, school_age_home_based_price_as_share_of_family_income_2018_percent: double, womens_labor_force_participation_rate_percent: double, womens_median_earnings_dollar: double, median_family_income_dollar: double, percent_of_families_in_poverty_percent: double, total_population: bigint, percent_asian_percent: double, percent_black_percent: double, percent_hispanic_of_any_race_percent: double, percent_white_percent: double, reference_sheetname: string]
Storing data into Cassandra¶
To store the dataset efficiently in Cassandra, a Cassandra environment was initialized using Docker Compose and connected to the database using cqlsh.
- A keyspace named childcare was created with the SimpleStrategy replication strategy and a replication factor of 1, ensuring single-node data replication for simplicity during testing. Durable writes were enabled to guarantee data persistence.
- Within the childcare keyspace, a table named childcare_prices with a schema tailored to the dataset. The schema includes columns for state and county information, childcare pricing metrics across various age groups and years (both in dollar amounts and percentages), demographic and economic data such as median family income and racial composition, and a reference column to track the source of each record.
This structure ensures that the data is well-organized and optimized for queries by geographic and demographic attributes.
# CASSANDRA CONFIGURATION
cassandra_host = "cassandra"
# Spark init
spark = SparkSession.builder \
.master("local") \
.appName('jupyter-pyspark') \
.config("spark.cassandra.connection.host", cassandra_host) \
.config("spark.jars.packages","com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.1.0")\
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")
print(cassandra_host)
cassandra
# WE NEED A TABLE BEFORE WE CAN WRITE, Using Plain old Python
!pip install -q cassandra-driver
from cassandra.cluster import Cluster
with Cluster([cassandra_host]) as cluster:
session = cluster.connect()
session.execute("CREATE KEYSPACE IF NOT EXISTS childcare WITH replication={ 'class': 'SimpleStrategy', 'replication_factor' : 1 };")
session.execute("""
CREATE TABLE IF NOT EXISTS childcare.childcare_prices (
state_name text,
county_name text,
county_fips_code int,
infant_center_based_price_2018_dollar double,
infant_center_based_price_2023_estimated_dollar double,
infant_center_based_price_as_share_of_family_income_2018_percent double,
infant_home_based_price_2018_dollar double,
infant_home_based_price_2023_estimated_dollar double,
infant_home_based_price_as_share_of_family_income_2018_percent double,
toddler_center_based_price_2018_dollar double,
toddler_center_based_price_2023_estimated_dollar double,
toddler_center_based_price_as_share_of_family_income_2018_percent double,
toddler_home_based_price_2018_dollar double,
toddler_home_based_price_2023_estimated_dollar double,
toddler_home_based_price_as_share_of_family_income_2018_percent double,
preschool_center_based_price_2018_dollar double,
preschool_center_based_price_2023_estimated_dollar double,
preschool_center_based_price_as_share_of_family_income_2018_percent double,
preschool_home_based_price_2018_dollar double,
preschool_home_based_price_2023_estimated_dollar double,
preschool_home_based_price_as_share_of_family_income_2018_percent double,
school_age_center_based_price_2018_dollar double,
school_age_center_based_price_2023_estimated_dollar double,
school_age_center_based_price_as_share_of_family_income_2018_percent double,
school_age_home_based_price_2018_dollar double,
school_age_home_based_price_2023_estimated_dollar double,
school_age_home_based_price_as_share_of_family_income_2018_percent double,
womens_labor_force_participation_rate_percent double,
womens_median_earnings_dollar double,
median_family_income_dollar double,
percent_of_families_in_poverty_percent double,
total_population int,
percent_asian_percent double,
percent_black_percent double,
percent_hispanic_of_any_race_percent double,
percent_white_percent double,
reference_sheetname text,
PRIMARY KEY (state_name, county_name)
);
""")
# Write to back to our newly-minted Cassandra table, Append mode is okay here because of Cassandra's default upsert behavior.
spark_df.write.format("org.apache.spark.sql.cassandra")\
.mode("Append")\
.option("table", "childcare_prices")\
.option("keyspace","childcare")\
.save()
Read from Cassandra client
# read back from Cassandra
df1 =spark.read.format("org.apache.spark.sql.cassandra")\
.options(table="childcare_prices", keyspace="childcare") \
.load()
df1.toPandas()
| state_name | county_name | county_fips_code | infant_center_based_price_2018_dollar | infant_center_based_price_2023_estimated_dollar | infant_center_based_price_as_share_of_family_income_2018_percent | infant_home_based_price_2018_dollar | infant_home_based_price_2023_estimated_dollar | infant_home_based_price_as_share_of_family_income_2018_percent | median_family_income_dollar | ... | school_age_home_based_price_as_share_of_family_income_2018_percent | toddler_center_based_price_2018_dollar | toddler_center_based_price_2023_estimated_dollar | toddler_center_based_price_as_share_of_family_income_2018_percent | toddler_home_based_price_2018_dollar | toddler_home_based_price_2023_estimated_dollar | toddler_home_based_price_as_share_of_family_income_2018_percent | total_population | womens_labor_force_participation_rate_percent | womens_median_earnings_dollar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Tennessee | Anderson County | 47001 | 9620.00 | 11323.241697 | 15.715103 | 7020.0 | 8262.906103 | 11.467778 | 61215.0 | ... | 5.521522 | 7676.76 | 9035.946874 | 12.540652 | 6383.0 | 7513.123883 | 10.427183 | 75775 | 66.9 | 23018.0 |
| 1 | Tennessee | Bedford County | 47003 | 6605.04 | 7774.476542 | 12.264260 | 5070.0 | 5967.654408 | 9.413993 | 53856.0 | ... | 6.758764 | 5720.00 | 6732.738306 | 10.620915 | 5070.0 | 5967.654408 | 9.413993 | 47558 | 68.7 | 23171.0 |
| 2 | Tennessee | Benton County | 47005 | 6605.04 | 7774.476542 | 14.095864 | 5070.0 | 5967.654408 | 10.819924 | 46858.0 | ... | 7.768151 | 5720.00 | 6732.738306 | 12.207094 | 5070.0 | 5967.654408 | 10.819924 | 16112 | 59.7 | 17419.0 |
| 3 | Tennessee | Bledsoe County | 47007 | 6605.04 | 7774.476542 | 12.419691 | 5070.0 | 5967.654408 | 9.533301 | 53182.0 | ... | 6.844421 | 5720.00 | 6732.738306 | 10.755519 | 5070.0 | 5967.654408 | 9.533301 | 14602 | 54.8 | 21560.0 |
| 4 | Tennessee | Blount County | 47009 | 9620.00 | 11323.241697 | 14.354353 | 7020.0 | 8262.906103 | 10.474798 | 67018.0 | ... | 5.043421 | 7676.76 | 9035.946874 | 11.454773 | 6383.0 | 7513.123883 | 9.524307 | 128443 | 70.0 | 24574.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3137 | Kansas | Washington County | 20201 | 5876.00 | 6916.358442 | 9.849310 | 5007.6 | 5894.206353 | 8.393704 | 59659.0 | ... | 7.121138 | 6089.20 | 7167.305960 | 10.206675 | 4248.4 | 5000.588360 | 7.121138 | 5525 | 79.9 | 22018.0 |
| 3138 | Kansas | Wichita County | 20203 | 5876.00 | 6916.358442 | 8.596928 | 5007.6 | 5894.206353 | 7.326408 | 68350.0 | ... | 6.215655 | 6089.20 | 7167.305960 | 8.908851 | 4248.4 | 5000.588360 | 6.215655 | 2143 | 74.2 | 27878.0 |
| 3139 | Kansas | Wilson County | 20205 | 5876.00 | 6916.358442 | 10.140826 | 5007.6 | 5894.206353 | 8.642137 | 57944.0 | ... | 7.331907 | 6089.20 | 7167.305960 | 10.508767 | 4248.4 | 5000.588360 | 7.331907 | 8780 | 74.6 | 21404.0 |
| 3140 | Kansas | Woodson County | 20207 | 5876.00 | 6916.358442 | 11.161130 | 5007.6 | 5894.206353 | 9.511653 | 52647.0 | ... | 8.069596 | 6089.20 | 7167.305960 | 11.566091 | 4248.4 | 5000.588360 | 8.069596 | 3170 | 73.2 | 19744.0 |
| 3141 | Kansas | Wyandotte County | 20209 | 7883.20 | 9278.937520 | 14.822503 | 5746.0 | 6763.341662 | 10.804001 | 53184.0 | ... | 9.660048 | 7332.00 | 8630.146374 | 13.786101 | 5137.6 | 6047.223133 | 9.660048 | 164345 | 71.1 | 24739.0 |
3142 rows × 37 columns
# Filter data from Cassandra for a new table to Elasticsearch
newdf1 = df1.select("state_name", "county_name", "infant_center_based_price_2018_dollar",
"infant_center_based_price_as_share_of_family_income_2018_percent", "infant_home_based_price_2018_dollar",
"infant_home_based_price_as_share_of_family_income_2018_percent", "toddler_center_based_price_2018_dollar",
"toddler_center_based_price_as_share_of_family_income_2018_percent",
"toddler_home_based_price_2018_dollar", "toddler_home_based_price_as_share_of_family_income_2018_percent",
"preschool_center_based_price_2018_dollar", "preschool_center_based_price_as_share_of_family_income_2018_percent",
"preschool_home_based_price_2018_dollar", "preschool_home_based_price_as_share_of_family_income_2018_percent",
"school_age_center_based_price_2018_dollar", "school_age_center_based_price_as_share_of_family_income_2018_percent",
"school_age_home_based_price_2018_dollar", "school_age_home_based_price_as_share_of_family_income_2018_percent",
"total_population", "womens_labor_force_participation_rate_percent", "womens_median_earnings_dollar",
"median_family_income_dollar", "percent_of_families_in_poverty_percent", "percent_asian_percent",
"percent_black_percent", "percent_hispanic_of_any_race_percent", "percent_white_percent") \
# Display the new DataFrame
newdf1.toPandas()
| state_name | county_name | infant_center_based_price_2018_dollar | infant_center_based_price_as_share_of_family_income_2018_percent | infant_home_based_price_2018_dollar | infant_home_based_price_as_share_of_family_income_2018_percent | toddler_center_based_price_2018_dollar | toddler_center_based_price_as_share_of_family_income_2018_percent | toddler_home_based_price_2018_dollar | toddler_home_based_price_as_share_of_family_income_2018_percent | ... | school_age_home_based_price_as_share_of_family_income_2018_percent | total_population | womens_labor_force_participation_rate_percent | womens_median_earnings_dollar | median_family_income_dollar | percent_of_families_in_poverty_percent | percent_asian_percent | percent_black_percent | percent_hispanic_of_any_race_percent | percent_white_percent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Oklahoma | Adair County | 6583.72 | 15.911545 | 4958.72 | 11.984243 | 5535.40 | 13.377964 | 4815.72 | 11.638640 | ... | 8.954250 | 22113 | 58.3 | 23033.0 | 41377.0 | 23.2 | 0.7 | 0.3 | 6.6 | 42.3 |
| 1 | Oklahoma | Alfalfa County | 7929.48 | 11.652261 | 5969.08 | 8.771480 | 6665.36 | 9.794654 | 5796.96 | 8.518552 | ... | 6.550146 | 5857 | 68.8 | 22112.0 | 68051.0 | 5.4 | 0.0 | 3.3 | 5.3 | 77.6 |
| 2 | Oklahoma | Atoka County | 6705.40 | 14.399467 | 5053.36 | 10.851805 | 5639.92 | 12.111409 | 4907.76 | 10.539137 | ... | 8.117079 | 13874 | 61.1 | 23214.0 | 46567.0 | 16.0 | 0.6 | 3.9 | 3.5 | 72.8 |
| 3 | Oklahoma | Beaver County | 7760.48 | 13.160717 | 5842.72 | 9.908457 | 6523.92 | 11.063680 | 5674.24 | 9.622738 | ... | 7.399596 | 5415 | 62.7 | 21181.0 | 58967.0 | 9.2 | 0.0 | 0.3 | 23.0 | 80.3 |
| 4 | Oklahoma | Beckham County | 7985.64 | 12.439660 | 6014.32 | 9.368829 | 6714.24 | 10.459132 | 5840.64 | 9.098279 | ... | 7.000296 | 22621 | 61.6 | 22872.0 | 64195.0 | 7.8 | 0.8 | 3.8 | 14.5 | 83.8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3137 | Kansas | Washington County | 5876.00 | 9.849310 | 5007.60 | 8.393704 | 6089.20 | 10.206675 | 4248.40 | 7.121138 | ... | 7.121138 | 5525 | 79.9 | 22018.0 | 59659.0 | 7.5 | 0.0 | 0.3 | 3.4 | 96.5 |
| 3138 | Kansas | Wichita County | 5876.00 | 8.596928 | 5007.60 | 7.326408 | 6089.20 | 8.908851 | 4248.40 | 6.215655 | ... | 6.215655 | 2143 | 74.2 | 27878.0 | 68350.0 | 3.1 | 0.9 | 0.9 | 29.2 | 90.1 |
| 3139 | Kansas | Wilson County | 5876.00 | 10.140826 | 5007.60 | 8.642137 | 6089.20 | 10.508767 | 4248.40 | 7.331907 | ... | 7.331907 | 8780 | 74.6 | 21404.0 | 57944.0 | 9.3 | 0.0 | 0.6 | 3.0 | 94.5 |
| 3140 | Kansas | Woodson County | 5876.00 | 11.161130 | 5007.60 | 9.511653 | 6089.20 | 11.566091 | 4248.40 | 8.069596 | ... | 8.069596 | 3170 | 73.2 | 19744.0 | 52647.0 | 13.8 | 0.0 | 1.1 | 0.9 | 93.7 |
| 3141 | Kansas | Wyandotte County | 7883.20 | 14.822503 | 5746.00 | 10.804001 | 7332.00 | 13.786101 | 5137.60 | 9.660048 | ... | 9.660048 | 164345 | 71.1 | 24739.0 | 53184.0 | 16.1 | 4.4 | 22.5 | 28.4 | 59.8 |
3142 rows × 27 columns
And save locally¶
newdf1.write.format("json").mode("overwrite").save("/home/jovyan/Walker-Labs/Project/datasets/childcare.json")
Dataset #2: Labor Force Statistics: Employment and Earnings¶
Problem Statement¶
The dataset contains hierarchical information on employment statistics categorized by age, sex, race, ethnicity, and various employment metrics. The data is unstructured and difficult to query efficiently due to its nested relationships and mixed levels of granularity. Furthermore, the dataset includes unnecessary rows, columns, and some missing values that must be cleaned before storage.
Data Processing¶
Remove Unnecessary Columns and Rows:
- Drop rows containing irrelevant or redundant information, such as totals or footnotes.
- Eliminate columns with no meaningful data or those outside the scope of the analysis.
Handle Missing Values:
- Remove rows with missing values in key attributes such as Age, Sex, Race, Ethnicity.
- Ensure remaining data is complete for meaningful analysis.
Categorize Data¶
- Organize the dataset into hierarchical categories:
- Main Categories: Summary Totals, White, Black or African American, Asian, Hispanic or Latino ethnicity.
- Subcategories: Men, Women.
- Subcollections: Years (e.g., 16 to 19, 20 to 24).
- Create a nested structure to group related metrics for each demographic.
Transform into JSON¶
- Convert the cleaned and categorized dataset into a JSON format, preserving its hierarchical structure.
Data Storage¶
MongoDB is chosen as the storage solution because:
- Support for Hierarchical Data: MongoDB’s flexible schema supports nested structures, allowing the data’s inherent hierarchy to be retained without flattening.
- Scalable and Efficient Querying: MongoDB enables ad-hoc queries and analysis of hierarchical data with minimal overhead, making it ideal for dynamic insights.
- Flexibility: The schema-less design ensures future modifications, such as adding new categories or metrics, are seamless.
- Better Readability: The JSON-like structure makes it easy to visualize relationships between categories and subcategories, aiding downstream analysis.
This approach ensures that the data is both clean and well-organized, allowing for efficient storage and retrieval in MongoDB while preserving the analytical flexibility required for hierarchical datasets.
# Bureau of Labor Statistics' 2018 data on earnings by age, race, & sex
# MinIO Configuration
s3_url = "http://minio:9000"
s3_key = "minio"
s3_secret = "SU2orange!"
s3_bucket = "project"
s3_file_key = "cpsaat08.xlsx"
# Initialize MinIO client
s3 = boto3.client('s3',
endpoint_url=s3_url,
aws_access_key_id=s3_key,
aws_secret_access_key=s3_secret)
# Download Excel file locally from MinIO
local_file = "/tmp/cpsaat08.xlsx"
s3.download_file(s3_bucket, s3_file_key, local_file)
# Read the Excel file using pandas
# Specify the sheet name and skip rows if needed
bls_pd = pd.read_excel(local_file, sheet_name="cpsaat08", skiprows=9)
# Adjust Pandas display options to show all rows
pd.set_option('display.max_rows', None) # Show all rows
pd.set_option('display.max_columns', None) # Show all columns
pd.set_option('display.width', None) # Expand display width to avoid line wrapping
# Display the DataFrame
display(bls_pd)
# Optionally, reset Pandas options to defaults after displaying
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.width')
| TOTAL | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Total, 16 years and over | 128572.0 | 115402.0 | 9357.0 | 3814.0 | 27189.0 | 3564.0 | 22072.0 | 1553.0 | 5052.0 | 1262.0 |
| 1 | 16 to 19 years | 1478.0 | 1273.0 | 173.0 | 33.0 | 3648.0 | 243.0 | 3250.0 | 155.0 | 355.0 | 404.0 |
| 2 | 16 to 17 years | 231.0 | 192.0 | 33.0 | 5.0 | 1607.0 | 31.0 | 1501.0 | 75.0 | 70.0 | 225.0 |
| 3 | 18 to 19 years | 1247.0 | 1081.0 | 139.0 | 27.0 | 2041.0 | 212.0 | 1749.0 | 80.0 | 285.0 | 179.0 |
| 4 | 20 years and over | 127095.0 | 114129.0 | 9184.0 | 3781.0 | 23540.0 | 3320.0 | 18822.0 | 1398.0 | 4697.0 | 858.0 |
| 5 | 20 to 24 years | 9543.0 | 8629.0 | 722.0 | 192.0 | 4508.0 | 713.0 | 3632.0 | 163.0 | 856.0 | 192.0 |
| 6 | 25 years and over | 117552.0 | 105500.0 | 8462.0 | 3589.0 | 19032.0 | 2608.0 | 15190.0 | 1235.0 | 3841.0 | 666.0 |
| 7 | 25 to 54 years | 88993.0 | 80300.0 | 6156.0 | 2536.0 | 11321.0 | 2022.0 | 8640.0 | 659.0 | 2991.0 | 399.0 |
| 8 | 55 years and over | 28559.0 | 25200.0 | 2306.0 | 1053.0 | 7711.0 | 585.0 | 6549.0 | 577.0 | 850.0 | 266.0 |
| 9 | Men, 16 years and over | 72935.0 | 66314.0 | 4774.0 | 1847.0 | 9764.0 | 1650.0 | 7587.0 | 527.0 | 2877.0 | 521.0 |
| 10 | 16 to 19 years | 874.0 | 758.0 | 96.0 | 20.0 | 1613.0 | 125.0 | 1425.0 | 64.0 | 219.0 | 203.0 |
| 11 | 20 years and over | 72061.0 | 65556.0 | 4678.0 | 1828.0 | 8150.0 | 1525.0 | 6162.0 | 463.0 | 2658.0 | 318.0 |
| 12 | 20 to 24 years | 5261.0 | 4815.0 | 367.0 | 79.0 | 1929.0 | 352.0 | 1515.0 | 62.0 | 512.0 | 84.0 |
| 13 | 25 years and over | 66800.0 | 60740.0 | 4310.0 | 1749.0 | 6222.0 | 1174.0 | 4646.0 | 402.0 | 2146.0 | 234.0 |
| 14 | 25 to 54 years | 50472.0 | 46166.0 | 3125.0 | 1181.0 | 3201.0 | 918.0 | 2123.0 | 160.0 | 1655.0 | 113.0 |
| 15 | 55 years and over | 16328.0 | 14575.0 | 1185.0 | 568.0 | 3020.0 | 255.0 | 2524.0 | 241.0 | 491.0 | 121.0 |
| 16 | Women, 16 years and over | 55638.0 | 49088.0 | 4583.0 | 1966.0 | 17425.0 | 1914.0 | 14485.0 | 1026.0 | 2175.0 | 741.0 |
| 17 | 16 to 19 years | 604.0 | 515.0 | 76.0 | 13.0 | 2035.0 | 119.0 | 1825.0 | 91.0 | 136.0 | 201.0 |
| 18 | 20 years and over | 55033.0 | 48574.0 | 4506.0 | 1953.0 | 15390.0 | 1795.0 | 12660.0 | 935.0 | 2039.0 | 540.0 |
| 19 | 20 to 24 years | 4281.0 | 3814.0 | 355.0 | 113.0 | 2579.0 | 361.0 | 2117.0 | 101.0 | 344.0 | 108.0 |
| 20 | 25 years and over | 50752.0 | 44760.0 | 4152.0 | 1840.0 | 12811.0 | 1434.0 | 10543.0 | 834.0 | 1695.0 | 432.0 |
| 21 | 25 to 54 years | 38521.0 | 34135.0 | 3031.0 | 1355.0 | 8120.0 | 1104.0 | 6518.0 | 498.0 | 1336.0 | 287.0 |
| 22 | 55 years and over | 12231.0 | 10625.0 | 1121.0 | 485.0 | 4691.0 | 330.0 | 4025.0 | 335.0 | 358.0 | 145.0 |
| 23 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24 | White | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25 | Men, 16 years and over | 58015.0 | 52678.0 | 3849.0 | 1487.0 | 7687.0 | 1183.0 | 6083.0 | 421.0 | 1998.0 | 383.0 |
| 26 | 16 to 19 years | 702.0 | 610.0 | 74.0 | 18.0 | 1281.0 | 90.0 | 1139.0 | 53.0 | 139.0 | 147.0 |
| 27 | 20 years and over | 57312.0 | 52068.0 | 3775.0 | 1470.0 | 6406.0 | 1093.0 | 4944.0 | 369.0 | 1858.0 | 235.0 |
| 28 | 20 to 24 years | 4111.0 | 3763.0 | 285.0 | 63.0 | 1450.0 | 251.0 | 1153.0 | 45.0 | 332.0 | 58.0 |
| 29 | 25 years and over | 53201.0 | 48305.0 | 3489.0 | 1406.0 | 4957.0 | 842.0 | 3791.0 | 323.0 | 1526.0 | 177.0 |
| 30 | 25 to 54 years | 39476.0 | 36068.0 | 2480.0 | 928.0 | 2343.0 | 644.0 | 1587.0 | 112.0 | 1140.0 | 75.0 |
| 31 | 55 years and over | 13726.0 | 12238.0 | 1009.0 | 479.0 | 2613.0 | 198.0 | 2204.0 | 211.0 | 386.0 | 102.0 |
| 32 | Women, 16 years and over | 41825.0 | 36781.0 | 3541.0 | 1504.0 | 13934.0 | 1363.0 | 11740.0 | 831.0 | 1414.0 | 560.0 |
| 33 | 16 to 19 years | 457.0 | 386.0 | 62.0 | 9.0 | 1620.0 | 92.0 | 1453.0 | 74.0 | 85.0 | 145.0 |
| 34 | 20 years and over | 41369.0 | 36395.0 | 3480.0 | 1494.0 | 12313.0 | 1271.0 | 10286.0 | 756.0 | 1329.0 | 415.0 |
| 35 | 20 to 24 years | 3187.0 | 2834.0 | 270.0 | 83.0 | 1950.0 | 256.0 | 1618.0 | 77.0 | 213.0 | 82.0 |
| 36 | 25 years and over | 38182.0 | 33561.0 | 3209.0 | 1412.0 | 10363.0 | 1015.0 | 8669.0 | 680.0 | 1116.0 | 333.0 |
| 37 | 25 to 54 years | 28451.0 | 25121.0 | 2311.0 | 1020.0 | 6338.0 | 768.0 | 5181.0 | 389.0 | 851.0 | 214.0 |
| 38 | 55 years and over | 9731.0 | 8441.0 | 898.0 | 392.0 | 4026.0 | 247.0 | 3488.0 | 291.0 | 265.0 | 119.0 |
| 39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 40 | Black or African American | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 41 | Men, 16 years and over | 7899.0 | 7239.0 | 463.0 | 197.0 | 1119.0 | 299.0 | 759.0 | 61.0 | 591.0 | 85.0 |
| 42 | 16 to 19 years | 103.0 | 89.0 | 14.0 | 0.0 | 170.0 | 24.0 | 140.0 | 6.0 | 60.0 | 34.0 |
| 43 | 20 years and over | 7797.0 | 7150.0 | 450.0 | 197.0 | 948.0 | 274.0 | 619.0 | 55.0 | 531.0 | 51.0 |
| 44 | 20 to 24 years | 659.0 | 610.0 | 42.0 | 7.0 | 254.0 | 64.0 | 180.0 | 10.0 | 118.0 | 14.0 |
| 45 | 25 years and over | 7137.0 | 6540.0 | 407.0 | 190.0 | 695.0 | 211.0 | 439.0 | 45.0 | 412.0 | 37.0 |
| 46 | 25 to 54 years | 5704.0 | 5252.0 | 312.0 | 139.0 | 487.0 | 181.0 | 276.0 | 30.0 | 345.0 | 24.0 |
| 47 | 55 years and over | 1434.0 | 1288.0 | 95.0 | 51.0 | 208.0 | 30.0 | 164.0 | 15.0 | 68.0 | 13.0 |
| 48 | Women, 16 years and over | 8154.0 | 7261.0 | 617.0 | 275.0 | 1920.0 | 365.0 | 1454.0 | 101.0 | 534.0 | 112.0 |
| 49 | 16 to 19 years | 95.0 | 82.0 | 10.0 | 3.0 | 227.0 | 14.0 | 205.0 | 8.0 | 39.0 | 34.0 |
| 50 | 20 years and over | 8058.0 | 7179.0 | 607.0 | 272.0 | 1693.0 | 350.0 | 1250.0 | 93.0 | 495.0 | 79.0 |
| 51 | 20 to 24 years | 667.0 | 595.0 | 51.0 | 22.0 | 360.0 | 73.0 | 272.0 | 15.0 | 86.0 | 15.0 |
| 52 | 25 years and over | 7391.0 | 6584.0 | 557.0 | 250.0 | 1333.0 | 278.0 | 977.0 | 78.0 | 409.0 | 64.0 |
| 53 | 25 to 54 years | 5861.0 | 5243.0 | 423.0 | 194.0 | 964.0 | 226.0 | 681.0 | 58.0 | 344.0 | 47.0 |
| 54 | 55 years and over | 1530.0 | 1341.0 | 133.0 | 56.0 | 368.0 | 52.0 | 296.0 | 21.0 | 64.0 | 17.0 |
| 55 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 56 | Asian | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 57 | Men, 16 years and over | 4661.0 | 4307.0 | 251.0 | 104.0 | 527.0 | 77.0 | 425.0 | 24.0 | 140.0 | 19.0 |
| 58 | 16 to 19 years | 22.0 | 20.0 | 2.0 | 0.0 | 63.0 | 3.0 | 57.0 | 3.0 | 3.0 | 4.0 |
| 59 | 20 years and over | 4639.0 | 4287.0 | 249.0 | 103.0 | 464.0 | 74.0 | 368.0 | 22.0 | 137.0 | 15.0 |
| 60 | 20 to 24 years | 209.0 | 190.0 | 15.0 | 4.0 | 121.0 | 14.0 | 104.0 | 3.0 | 26.0 | 7.0 |
| 61 | 25 years and over | 4430.0 | 4097.0 | 234.0 | 99.0 | 343.0 | 60.0 | 264.0 | 19.0 | 112.0 | 8.0 |
| 62 | 25 to 54 years | 3575.0 | 3321.0 | 184.0 | 70.0 | 211.0 | 40.0 | 161.0 | 10.0 | 87.0 | 5.0 |
| 63 | 55 years and over | 855.0 | 776.0 | 50.0 | 29.0 | 132.0 | 20.0 | 103.0 | 8.0 | 24.0 | 3.0 |
| 64 | Women, 16 years and over | 3733.0 | 3356.0 | 250.0 | 127.0 | 911.0 | 95.0 | 754.0 | 63.0 | 117.0 | 29.0 |
| 65 | 16 to 19 years | 19.0 | 17.0 | 2.0 | 0.0 | 62.0 | 3.0 | 55.0 | 4.0 | 3.0 | 7.0 |
| 66 | 20 years and over | 3714.0 | 3340.0 | 248.0 | 127.0 | 849.0 | 92.0 | 698.0 | 59.0 | 114.0 | 22.0 |
| 67 | 20 to 24 years | 196.0 | 181.0 | 12.0 | 4.0 | 134.0 | 14.0 | 114.0 | 7.0 | 22.0 | 5.0 |
| 68 | 25 years and over | 3518.0 | 3159.0 | 236.0 | 123.0 | 715.0 | 78.0 | 585.0 | 53.0 | 93.0 | 17.0 |
| 69 | 25 to 54 years | 2801.0 | 2524.0 | 179.0 | 98.0 | 518.0 | 61.0 | 422.0 | 35.0 | 75.0 | 12.0 |
| 70 | 55 years and over | 717.0 | 635.0 | 57.0 | 25.0 | 197.0 | 17.0 | 163.0 | 17.0 | 18.0 | 5.0 |
| 71 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 72 | Hispanic or Latino ethnicity | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 73 | Men, 16 years and over | 13719.0 | 12502.0 | 916.0 | 301.0 | 1699.0 | 438.0 | 1207.0 | 54.0 | 601.0 | 94.0 |
| 74 | 16 to 19 years | 230.0 | 201.0 | 25.0 | 4.0 | 315.0 | 29.0 | 278.0 | 8.0 | 57.0 | 47.0 |
| 75 | 20 years and over | 13489.0 | 12301.0 | 891.0 | 298.0 | 1384.0 | 408.0 | 930.0 | 46.0 | 544.0 | 47.0 |
| 76 | 20 to 24 years | 1276.0 | 1170.0 | 86.0 | 20.0 | 429.0 | 87.0 | 333.0 | 9.0 | 120.0 | 16.0 |
| 77 | 25 years and over | 12213.0 | 11131.0 | 805.0 | 277.0 | 955.0 | 321.0 | 597.0 | 37.0 | 424.0 | 31.0 |
| 78 | 25 to 54 years | 10306.0 | 9410.0 | 676.0 | 221.0 | 712.0 | 268.0 | 420.0 | 24.0 | 336.0 | 20.0 |
| 79 | 55 years and over | 1907.0 | 1721.0 | 129.0 | 57.0 | 243.0 | 53.0 | 177.0 | 13.0 | 88.0 | 11.0 |
| 80 | Women, 16 years and over | 8694.0 | 7707.0 | 691.0 | 296.0 | 2901.0 | 508.0 | 2270.0 | 123.0 | 465.0 | 163.0 |
| 81 | 16 to 19 years | 140.0 | 120.0 | 17.0 | 2.0 | 409.0 | 33.0 | 361.0 | 15.0 | 33.0 | 49.0 |
| 82 | 20 years and over | 8554.0 | 7587.0 | 673.0 | 293.0 | 2491.0 | 475.0 | 1909.0 | 107.0 | 432.0 | 115.0 |
| 83 | 20 to 24 years | 890.0 | 784.0 | 78.0 | 28.0 | 566.0 | 96.0 | 454.0 | 16.0 | 76.0 | 27.0 |
| 84 | 25 years and over | 7664.0 | 6803.0 | 595.0 | 265.0 | 1926.0 | 379.0 | 1455.0 | 91.0 | 356.0 | 88.0 |
| 85 | 25 to 54 years | 6468.0 | 5752.0 | 497.0 | 219.0 | 1512.0 | 299.0 | 1143.0 | 70.0 | 299.0 | 71.0 |
| 86 | 55 years and over | 1196.0 | 1051.0 | 99.0 | 47.0 | 414.0 | 80.0 | 312.0 | 22.0 | 57.0 | 17.0 |
| 87 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 88 | Footnotes: | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 89 | (1) Employed persons are classified as full- o... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 90 | (2) Includes some persons at work 35 hours or ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 91 | NOTE: Estimates for the above race groups (Whi... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
PART II: Dataset Preprocessing, Transformation, and Storage¶
This will include cleaning data, handling missing values, We'll ensure the data quality and integrity are maintained to ensure reliable analysis.
Description This code performs several essential data exploration and preprocessing tasks on the DataFrames.
Purpose To gather comprehensive information about the DataFrame's structure, statistical summary, and data quality, which is crucial for effective data analysis and preprocessing.
Get Shape and check unique values for Reference_SheetName column¶
print(f"Total number of columns: {len(bls_pd.columns)}")
print(" ")
print("Column names:", bls_pd.columns)
print(" ")
Total number of columns: 11
Column names: Index(['TOTAL', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4',
'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9',
'Unnamed: 10'],
dtype='object')
# Check unique values in the Reference_SheetName
print(bls_pd['TOTAL'].unique())
print(" ")
# Group by 'Reference_SheetName' and count rows
row_counts = bls_pd.groupby('TOTAL').size().reset_index(name='Total_Rows')
# Sort the results by 'Total_Rows' in descending order
row_counts = row_counts.sort_values(by='Total_Rows', ascending=False)
# Display the result
print(row_counts)
# Get Data Frame shape
#print(f"DataFrame shape: {childcare_df.shape}")
print(" ")
# Display DataFrame info
bls_pd.info()
['Total, 16 years and over' '16 to 19 years' '16 to 17 years'
'18 to 19 years' '20 years and over' '20 to 24 years' '25 years and over'
'25 to 54 years' '55 years and over' 'Men, 16 years and over'
'Women, 16 years and over' nan 'White' 'Black or African American'
'Asian' 'Hispanic or Latino ethnicity' 'Footnotes:'
'(1) Employed persons are classified as full- or part-time workers based on their usual weekly hours at all jobs regardless of the number of hours they are at work during the reference week. Persons absent from work also are classified according to their usual status. Full time is 35 hours or more per week; part time is less than 35 hours.'
'(2) Includes some persons at work 35 hours or more classified by their reason for usually working part time.'
'NOTE: Estimates for the above race groups (White, Black or African American, and Asian) do not sum to totals because data are not presented for all races. Persons whose ethnicity is identified as Hispanic or Latino may be of any race. Updated population controls are introduced annually with the release of January data.']
TOTAL Total_Rows
9 55 years and over 11
7 25 to 54 years 11
3 16 to 19 years 11
8 25 years and over 11
5 20 to 24 years 11
6 20 years and over 11
14 Men, 16 years and over 5
18 Women, 16 years and over 5
4 18 to 19 years 1
1 (2) Includes some persons at work 35 hours or ... 1
10 Asian 1
11 Black or African American 1
12 Footnotes: 1
13 Hispanic or Latino ethnicity 1
2 16 to 17 years 1
15 NOTE: Estimates for the above race groups (Whi... 1
16 Total, 16 years and over 1
17 White 1
0 (1) Employed persons are classified as full- o... 1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TOTAL 87 non-null object
1 Unnamed: 1 79 non-null float64
2 Unnamed: 2 79 non-null float64
3 Unnamed: 3 79 non-null float64
4 Unnamed: 4 79 non-null float64
5 Unnamed: 5 79 non-null float64
6 Unnamed: 6 79 non-null float64
7 Unnamed: 7 79 non-null float64
8 Unnamed: 8 79 non-null float64
9 Unnamed: 9 79 non-null float64
10 Unnamed: 10 79 non-null float64
dtypes: float64(10), object(1)
memory usage: 8.0+ KB
Standardizing Column Names for Consistent and Readable Data¶
# Rename columns in a Pandas DataFrame
# Sample column mapping
column_mapping = {
"TOTAL": "Age, Sex, Race, Ethnicity",
"Unnamed: 1": "Employed, Full-Time Workers - Total",
"Unnamed: 2": "Employed, Full-Time Workers at Work 35hrs or More",
"Unnamed: 3": "Employed, Full-Time Workers at Work Less Than 35hrs",
"Unnamed: 4": "Employed, Full-Time Workers - Not at Work",
"Unnamed: 5": "Employed, Part-Time Workers - Total",
"Unnamed: 6": "Employed, Part-Time for Economic Reasons",
"Unnamed: 7": "Employed, Part-Time for Non-Economic Reasons",
"Unnamed: 8": "Employed, Part-Time - Not At Work",
"Unnamed: 9": "Unemployed, Looking for Full-Time Work",
"Unnamed: 10": "Unemployed, Looking for Part-Time Work"
}
# Apply mapping and sanitize column names
sanitized_column_mapping = {key: sanitize_column(value) for key, value in column_mapping.items()}
# Sample DataFrame with original column names
original_columns = list(column_mapping.keys())
bls_df = pd.DataFrame(columns=original_columns)
# Rename and sanitize columns
bls_df = bls_df.rename(columns=sanitized_column_mapping)
# Display the new column names
print("Sanitized Column Names:")
print(list(bls_df.columns))
Sanitized Column Names: ['age_sex_race_ethnicity', 'employed_full_time_workers_total', 'employed_full_time_workers_at_work_35hrs_or_more', 'employed_full_time_workers_at_work_less_than_35hrs', 'employed_full_time_workers_not_at_work', 'employed_part_time_workers_total', 'employed_part_time_for_economic_reasons', 'employed_part_time_for_non_economic_reasons', 'employed_part_time_not_at_work', 'unemployed_looking_for_full_time_work', 'unemployed_looking_for_part_time_work']
# Rename and sanitize columns in the original DataFrame (bls_pd)
bls_pd = bls_pd.rename(columns=sanitized_column_mapping)
# Adjust Pandas display options to show all rows
pd.set_option('display.max_rows', None) # Show all rows
pd.set_option('display.max_columns', None) # Show all columns
pd.set_option('display.width', None) # Expand display width to avoid line wrapping
# Display the DataFrame
#display(bls_pd)
# Optionally, reset Pandas options to defaults after displaying
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.width')
# Define the indices of rows to remove
rows_to_remove = [87, 88, 89, 90, 91]
# Drop the rows by index
bls_pd = bls_pd.drop(rows_to_remove, axis=0)
# Reset the index (optional, if you want sequential indices)
bls_pd.reset_index(drop=True, inplace=True)
# Display the updated DataFrame
display(bls_pd)
| age_sex_race_ethnicity | employed_full_time_workers_total | employed_full_time_workers_at_work_35hrs_or_more | employed_full_time_workers_at_work_less_than_35hrs | employed_full_time_workers_not_at_work | employed_part_time_workers_total | employed_part_time_for_economic_reasons | employed_part_time_for_non_economic_reasons | employed_part_time_not_at_work | unemployed_looking_for_full_time_work | unemployed_looking_for_part_time_work | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Total, 16 years and over | 128572.0 | 115402.0 | 9357.0 | 3814.0 | 27189.0 | 3564.0 | 22072.0 | 1553.0 | 5052.0 | 1262.0 |
| 1 | 16 to 19 years | 1478.0 | 1273.0 | 173.0 | 33.0 | 3648.0 | 243.0 | 3250.0 | 155.0 | 355.0 | 404.0 |
| 2 | 16 to 17 years | 231.0 | 192.0 | 33.0 | 5.0 | 1607.0 | 31.0 | 1501.0 | 75.0 | 70.0 | 225.0 |
| 3 | 18 to 19 years | 1247.0 | 1081.0 | 139.0 | 27.0 | 2041.0 | 212.0 | 1749.0 | 80.0 | 285.0 | 179.0 |
| 4 | 20 years and over | 127095.0 | 114129.0 | 9184.0 | 3781.0 | 23540.0 | 3320.0 | 18822.0 | 1398.0 | 4697.0 | 858.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 82 | 20 years and over | 8554.0 | 7587.0 | 673.0 | 293.0 | 2491.0 | 475.0 | 1909.0 | 107.0 | 432.0 | 115.0 |
| 83 | 20 to 24 years | 890.0 | 784.0 | 78.0 | 28.0 | 566.0 | 96.0 | 454.0 | 16.0 | 76.0 | 27.0 |
| 84 | 25 years and over | 7664.0 | 6803.0 | 595.0 | 265.0 | 1926.0 | 379.0 | 1455.0 | 91.0 | 356.0 | 88.0 |
| 85 | 25 to 54 years | 6468.0 | 5752.0 | 497.0 | 219.0 | 1512.0 | 299.0 | 1143.0 | 70.0 | 299.0 | 71.0 |
| 86 | 55 years and over | 1196.0 | 1051.0 | 99.0 | 47.0 | 414.0 | 80.0 | 312.0 | 22.0 | 57.0 | 17.0 |
87 rows × 11 columns
# Remove rows where 'Age, Sex, Race, Ethnicity' column has NaN values
bls_pd = bls_pd.dropna(subset=['age_sex_race_ethnicity'])
# Adjust Pandas display options to show all rows
pd.set_option('display.max_rows', None) # Show all rows
pd.set_option('display.max_columns', None) # Show all columns
pd.set_option('display.width', None) # Expand display width to avoid line wrapping
# Display the DataFrame
display(bls_pd)
# Optionally, reset Pandas options to defaults after displaying
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.width')
| age_sex_race_ethnicity | employed_full_time_workers_total | employed_full_time_workers_at_work_35hrs_or_more | employed_full_time_workers_at_work_less_than_35hrs | employed_full_time_workers_not_at_work | employed_part_time_workers_total | employed_part_time_for_economic_reasons | employed_part_time_for_non_economic_reasons | employed_part_time_not_at_work | unemployed_looking_for_full_time_work | unemployed_looking_for_part_time_work | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Total, 16 years and over | 128572.0 | 115402.0 | 9357.0 | 3814.0 | 27189.0 | 3564.0 | 22072.0 | 1553.0 | 5052.0 | 1262.0 |
| 1 | 16 to 19 years | 1478.0 | 1273.0 | 173.0 | 33.0 | 3648.0 | 243.0 | 3250.0 | 155.0 | 355.0 | 404.0 |
| 2 | 16 to 17 years | 231.0 | 192.0 | 33.0 | 5.0 | 1607.0 | 31.0 | 1501.0 | 75.0 | 70.0 | 225.0 |
| 3 | 18 to 19 years | 1247.0 | 1081.0 | 139.0 | 27.0 | 2041.0 | 212.0 | 1749.0 | 80.0 | 285.0 | 179.0 |
| 4 | 20 years and over | 127095.0 | 114129.0 | 9184.0 | 3781.0 | 23540.0 | 3320.0 | 18822.0 | 1398.0 | 4697.0 | 858.0 |
| 5 | 20 to 24 years | 9543.0 | 8629.0 | 722.0 | 192.0 | 4508.0 | 713.0 | 3632.0 | 163.0 | 856.0 | 192.0 |
| 6 | 25 years and over | 117552.0 | 105500.0 | 8462.0 | 3589.0 | 19032.0 | 2608.0 | 15190.0 | 1235.0 | 3841.0 | 666.0 |
| 7 | 25 to 54 years | 88993.0 | 80300.0 | 6156.0 | 2536.0 | 11321.0 | 2022.0 | 8640.0 | 659.0 | 2991.0 | 399.0 |
| 8 | 55 years and over | 28559.0 | 25200.0 | 2306.0 | 1053.0 | 7711.0 | 585.0 | 6549.0 | 577.0 | 850.0 | 266.0 |
| 9 | Men, 16 years and over | 72935.0 | 66314.0 | 4774.0 | 1847.0 | 9764.0 | 1650.0 | 7587.0 | 527.0 | 2877.0 | 521.0 |
| 10 | 16 to 19 years | 874.0 | 758.0 | 96.0 | 20.0 | 1613.0 | 125.0 | 1425.0 | 64.0 | 219.0 | 203.0 |
| 11 | 20 years and over | 72061.0 | 65556.0 | 4678.0 | 1828.0 | 8150.0 | 1525.0 | 6162.0 | 463.0 | 2658.0 | 318.0 |
| 12 | 20 to 24 years | 5261.0 | 4815.0 | 367.0 | 79.0 | 1929.0 | 352.0 | 1515.0 | 62.0 | 512.0 | 84.0 |
| 13 | 25 years and over | 66800.0 | 60740.0 | 4310.0 | 1749.0 | 6222.0 | 1174.0 | 4646.0 | 402.0 | 2146.0 | 234.0 |
| 14 | 25 to 54 years | 50472.0 | 46166.0 | 3125.0 | 1181.0 | 3201.0 | 918.0 | 2123.0 | 160.0 | 1655.0 | 113.0 |
| 15 | 55 years and over | 16328.0 | 14575.0 | 1185.0 | 568.0 | 3020.0 | 255.0 | 2524.0 | 241.0 | 491.0 | 121.0 |
| 16 | Women, 16 years and over | 55638.0 | 49088.0 | 4583.0 | 1966.0 | 17425.0 | 1914.0 | 14485.0 | 1026.0 | 2175.0 | 741.0 |
| 17 | 16 to 19 years | 604.0 | 515.0 | 76.0 | 13.0 | 2035.0 | 119.0 | 1825.0 | 91.0 | 136.0 | 201.0 |
| 18 | 20 years and over | 55033.0 | 48574.0 | 4506.0 | 1953.0 | 15390.0 | 1795.0 | 12660.0 | 935.0 | 2039.0 | 540.0 |
| 19 | 20 to 24 years | 4281.0 | 3814.0 | 355.0 | 113.0 | 2579.0 | 361.0 | 2117.0 | 101.0 | 344.0 | 108.0 |
| 20 | 25 years and over | 50752.0 | 44760.0 | 4152.0 | 1840.0 | 12811.0 | 1434.0 | 10543.0 | 834.0 | 1695.0 | 432.0 |
| 21 | 25 to 54 years | 38521.0 | 34135.0 | 3031.0 | 1355.0 | 8120.0 | 1104.0 | 6518.0 | 498.0 | 1336.0 | 287.0 |
| 22 | 55 years and over | 12231.0 | 10625.0 | 1121.0 | 485.0 | 4691.0 | 330.0 | 4025.0 | 335.0 | 358.0 | 145.0 |
| 24 | White | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25 | Men, 16 years and over | 58015.0 | 52678.0 | 3849.0 | 1487.0 | 7687.0 | 1183.0 | 6083.0 | 421.0 | 1998.0 | 383.0 |
| 26 | 16 to 19 years | 702.0 | 610.0 | 74.0 | 18.0 | 1281.0 | 90.0 | 1139.0 | 53.0 | 139.0 | 147.0 |
| 27 | 20 years and over | 57312.0 | 52068.0 | 3775.0 | 1470.0 | 6406.0 | 1093.0 | 4944.0 | 369.0 | 1858.0 | 235.0 |
| 28 | 20 to 24 years | 4111.0 | 3763.0 | 285.0 | 63.0 | 1450.0 | 251.0 | 1153.0 | 45.0 | 332.0 | 58.0 |
| 29 | 25 years and over | 53201.0 | 48305.0 | 3489.0 | 1406.0 | 4957.0 | 842.0 | 3791.0 | 323.0 | 1526.0 | 177.0 |
| 30 | 25 to 54 years | 39476.0 | 36068.0 | 2480.0 | 928.0 | 2343.0 | 644.0 | 1587.0 | 112.0 | 1140.0 | 75.0 |
| 31 | 55 years and over | 13726.0 | 12238.0 | 1009.0 | 479.0 | 2613.0 | 198.0 | 2204.0 | 211.0 | 386.0 | 102.0 |
| 32 | Women, 16 years and over | 41825.0 | 36781.0 | 3541.0 | 1504.0 | 13934.0 | 1363.0 | 11740.0 | 831.0 | 1414.0 | 560.0 |
| 33 | 16 to 19 years | 457.0 | 386.0 | 62.0 | 9.0 | 1620.0 | 92.0 | 1453.0 | 74.0 | 85.0 | 145.0 |
| 34 | 20 years and over | 41369.0 | 36395.0 | 3480.0 | 1494.0 | 12313.0 | 1271.0 | 10286.0 | 756.0 | 1329.0 | 415.0 |
| 35 | 20 to 24 years | 3187.0 | 2834.0 | 270.0 | 83.0 | 1950.0 | 256.0 | 1618.0 | 77.0 | 213.0 | 82.0 |
| 36 | 25 years and over | 38182.0 | 33561.0 | 3209.0 | 1412.0 | 10363.0 | 1015.0 | 8669.0 | 680.0 | 1116.0 | 333.0 |
| 37 | 25 to 54 years | 28451.0 | 25121.0 | 2311.0 | 1020.0 | 6338.0 | 768.0 | 5181.0 | 389.0 | 851.0 | 214.0 |
| 38 | 55 years and over | 9731.0 | 8441.0 | 898.0 | 392.0 | 4026.0 | 247.0 | 3488.0 | 291.0 | 265.0 | 119.0 |
| 40 | Black or African American | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 41 | Men, 16 years and over | 7899.0 | 7239.0 | 463.0 | 197.0 | 1119.0 | 299.0 | 759.0 | 61.0 | 591.0 | 85.0 |
| 42 | 16 to 19 years | 103.0 | 89.0 | 14.0 | 0.0 | 170.0 | 24.0 | 140.0 | 6.0 | 60.0 | 34.0 |
| 43 | 20 years and over | 7797.0 | 7150.0 | 450.0 | 197.0 | 948.0 | 274.0 | 619.0 | 55.0 | 531.0 | 51.0 |
| 44 | 20 to 24 years | 659.0 | 610.0 | 42.0 | 7.0 | 254.0 | 64.0 | 180.0 | 10.0 | 118.0 | 14.0 |
| 45 | 25 years and over | 7137.0 | 6540.0 | 407.0 | 190.0 | 695.0 | 211.0 | 439.0 | 45.0 | 412.0 | 37.0 |
| 46 | 25 to 54 years | 5704.0 | 5252.0 | 312.0 | 139.0 | 487.0 | 181.0 | 276.0 | 30.0 | 345.0 | 24.0 |
| 47 | 55 years and over | 1434.0 | 1288.0 | 95.0 | 51.0 | 208.0 | 30.0 | 164.0 | 15.0 | 68.0 | 13.0 |
| 48 | Women, 16 years and over | 8154.0 | 7261.0 | 617.0 | 275.0 | 1920.0 | 365.0 | 1454.0 | 101.0 | 534.0 | 112.0 |
| 49 | 16 to 19 years | 95.0 | 82.0 | 10.0 | 3.0 | 227.0 | 14.0 | 205.0 | 8.0 | 39.0 | 34.0 |
| 50 | 20 years and over | 8058.0 | 7179.0 | 607.0 | 272.0 | 1693.0 | 350.0 | 1250.0 | 93.0 | 495.0 | 79.0 |
| 51 | 20 to 24 years | 667.0 | 595.0 | 51.0 | 22.0 | 360.0 | 73.0 | 272.0 | 15.0 | 86.0 | 15.0 |
| 52 | 25 years and over | 7391.0 | 6584.0 | 557.0 | 250.0 | 1333.0 | 278.0 | 977.0 | 78.0 | 409.0 | 64.0 |
| 53 | 25 to 54 years | 5861.0 | 5243.0 | 423.0 | 194.0 | 964.0 | 226.0 | 681.0 | 58.0 | 344.0 | 47.0 |
| 54 | 55 years and over | 1530.0 | 1341.0 | 133.0 | 56.0 | 368.0 | 52.0 | 296.0 | 21.0 | 64.0 | 17.0 |
| 56 | Asian | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 57 | Men, 16 years and over | 4661.0 | 4307.0 | 251.0 | 104.0 | 527.0 | 77.0 | 425.0 | 24.0 | 140.0 | 19.0 |
| 58 | 16 to 19 years | 22.0 | 20.0 | 2.0 | 0.0 | 63.0 | 3.0 | 57.0 | 3.0 | 3.0 | 4.0 |
| 59 | 20 years and over | 4639.0 | 4287.0 | 249.0 | 103.0 | 464.0 | 74.0 | 368.0 | 22.0 | 137.0 | 15.0 |
| 60 | 20 to 24 years | 209.0 | 190.0 | 15.0 | 4.0 | 121.0 | 14.0 | 104.0 | 3.0 | 26.0 | 7.0 |
| 61 | 25 years and over | 4430.0 | 4097.0 | 234.0 | 99.0 | 343.0 | 60.0 | 264.0 | 19.0 | 112.0 | 8.0 |
| 62 | 25 to 54 years | 3575.0 | 3321.0 | 184.0 | 70.0 | 211.0 | 40.0 | 161.0 | 10.0 | 87.0 | 5.0 |
| 63 | 55 years and over | 855.0 | 776.0 | 50.0 | 29.0 | 132.0 | 20.0 | 103.0 | 8.0 | 24.0 | 3.0 |
| 64 | Women, 16 years and over | 3733.0 | 3356.0 | 250.0 | 127.0 | 911.0 | 95.0 | 754.0 | 63.0 | 117.0 | 29.0 |
| 65 | 16 to 19 years | 19.0 | 17.0 | 2.0 | 0.0 | 62.0 | 3.0 | 55.0 | 4.0 | 3.0 | 7.0 |
| 66 | 20 years and over | 3714.0 | 3340.0 | 248.0 | 127.0 | 849.0 | 92.0 | 698.0 | 59.0 | 114.0 | 22.0 |
| 67 | 20 to 24 years | 196.0 | 181.0 | 12.0 | 4.0 | 134.0 | 14.0 | 114.0 | 7.0 | 22.0 | 5.0 |
| 68 | 25 years and over | 3518.0 | 3159.0 | 236.0 | 123.0 | 715.0 | 78.0 | 585.0 | 53.0 | 93.0 | 17.0 |
| 69 | 25 to 54 years | 2801.0 | 2524.0 | 179.0 | 98.0 | 518.0 | 61.0 | 422.0 | 35.0 | 75.0 | 12.0 |
| 70 | 55 years and over | 717.0 | 635.0 | 57.0 | 25.0 | 197.0 | 17.0 | 163.0 | 17.0 | 18.0 | 5.0 |
| 72 | Hispanic or Latino ethnicity | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 73 | Men, 16 years and over | 13719.0 | 12502.0 | 916.0 | 301.0 | 1699.0 | 438.0 | 1207.0 | 54.0 | 601.0 | 94.0 |
| 74 | 16 to 19 years | 230.0 | 201.0 | 25.0 | 4.0 | 315.0 | 29.0 | 278.0 | 8.0 | 57.0 | 47.0 |
| 75 | 20 years and over | 13489.0 | 12301.0 | 891.0 | 298.0 | 1384.0 | 408.0 | 930.0 | 46.0 | 544.0 | 47.0 |
| 76 | 20 to 24 years | 1276.0 | 1170.0 | 86.0 | 20.0 | 429.0 | 87.0 | 333.0 | 9.0 | 120.0 | 16.0 |
| 77 | 25 years and over | 12213.0 | 11131.0 | 805.0 | 277.0 | 955.0 | 321.0 | 597.0 | 37.0 | 424.0 | 31.0 |
| 78 | 25 to 54 years | 10306.0 | 9410.0 | 676.0 | 221.0 | 712.0 | 268.0 | 420.0 | 24.0 | 336.0 | 20.0 |
| 79 | 55 years and over | 1907.0 | 1721.0 | 129.0 | 57.0 | 243.0 | 53.0 | 177.0 | 13.0 | 88.0 | 11.0 |
| 80 | Women, 16 years and over | 8694.0 | 7707.0 | 691.0 | 296.0 | 2901.0 | 508.0 | 2270.0 | 123.0 | 465.0 | 163.0 |
| 81 | 16 to 19 years | 140.0 | 120.0 | 17.0 | 2.0 | 409.0 | 33.0 | 361.0 | 15.0 | 33.0 | 49.0 |
| 82 | 20 years and over | 8554.0 | 7587.0 | 673.0 | 293.0 | 2491.0 | 475.0 | 1909.0 | 107.0 | 432.0 | 115.0 |
| 83 | 20 to 24 years | 890.0 | 784.0 | 78.0 | 28.0 | 566.0 | 96.0 | 454.0 | 16.0 | 76.0 | 27.0 |
| 84 | 25 years and over | 7664.0 | 6803.0 | 595.0 | 265.0 | 1926.0 | 379.0 | 1455.0 | 91.0 | 356.0 | 88.0 |
| 85 | 25 to 54 years | 6468.0 | 5752.0 | 497.0 | 219.0 | 1512.0 | 299.0 | 1143.0 | 70.0 | 299.0 | 71.0 |
| 86 | 55 years and over | 1196.0 | 1051.0 | 99.0 | 47.0 | 414.0 | 80.0 | 312.0 | 22.0 | 57.0 | 17.0 |
Save to Mongo¶
# Export the DataFrame to CSV
output_csv_path = "bls_pd_data.csv" # Local file path
bls_pd.to_csv(output_csv_path, index=False)
print(f"Data exported to {output_csv_path}")
Data exported to bls_pd_data.csv
import os
# Print the current working directory
print("File exported to:", os.getcwd())
File exported to: /home/jovyan/Walker - Labs/Project
See Part II for BLS Data to Mongo¶
Dataset #3: National Center for Education Studies¶
Problem Statement¶
The dataset from the National Center for Education Studies (NCES) contains state-level educational statistics. However, the dataset requires significant preprocessing to address missing values, standardize formats, and optimize storage for efficient querying and analysis. Without these preprocessing steps, the dataset risks becoming error-prone and challenging to analyze, limiting its usability for decision-making and educational policy planning.
Data Processing¶
Handle Missing Values:
- Identify columns with missing values and impute them with default values such as:
- Zero for numeric fields.
- "Unknown" for categorical fields.
- Ensure completeness for accurate analytics and computations.
Standardize Column Names:
- Sanitize column names by:
- Removing spaces and special characters (e.g., replacing % with _percent, $ with _dollar).
- Converting to snake_case for consistency and ease of querying.
- Validate column names for compliance with Cassandra's schema requirements.
Validate Data Consistency:
- Group data by state and validate counts for each category (e.g., school levels, demographics).
- Ensure data integrity across rows and columns:
- Numeric consistency for statistical columns.
- Uniform categorizations for demographic groups.
Prepare Data for Querying¶
Clean numeric columns by:
- Removing invalid or outlier values.
- Converting strings to numeric types where necessary.
Create a flattened, structured dataset optimized for database insertion:
- Combine relevant metadata into a single table for easier querying.
Data Storage¶
Cassandra is selected as the preferred storage solution due to its scalability and suitability for write-heavy workloads in distributed systems. The storage will be optimized for educational datasets.
Advantages of Cassandra
**High Scalability: ** Cassandra can handle large datasets, ideal for managing state-wise and nationwide educational statistics.
**Distributed Data Model: ** The architecture enables quick access and querying, even for distributed analytics across multiple regions.
**Efficient Querying: ** Schema design with partition keys like state or school_level ensures fast data retrieval.
**Write-Optimized: ** Cassandra's write-heavy optimization makes it suitable for continuously ingesting new datasets.
# MinIO Configuration
s3_url = "http://minio:9000"
s3_key = "minio"
s3_secret = "SU2orange!"
s3_bucket = "project"
s3_file_key = "tabn102.30.xlsx"
# Initialize MinIO client
s3 = boto3.client('s3',
endpoint_url=s3_url,
aws_access_key_id=s3_key,
aws_secret_access_key=s3_secret)
# Download Excel file locally from MinIO
local_file = "/home/jovyan/datasets/tabn102.30.xlsx
s3.download_file(s3_bucket, s3_file_key, local_file)
#if fails, try this replacing local file as this
#local_file = "/tmp/tabn102.30.xlsx
Note: To prevent Jupyter Notebook from throwing errors related to the Spark context:
- Restart the kernel.
- Re-run the necessary library installations and imports.
- Execute the required user defined functions in order.
For current dataset, Dataset #3, I have commented out the initialization of Spark with MinIO to ensure the Spark context is configured for Cassandra. This prevents conflicts and avoids potential errors.
pip install boto3
pip install openpyxl
pip install -q cassandra-driver
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import os
import subprocess
from urllib.request import Request, urlopen, urlretrieve
from urllib.error import HTTPError
# install if needed: pip install boto3 , openpyxl
import boto3
from openpyxl import load_workbook
from pyspark.sql.functions import lit
import pandas as pd
from IPython.display import display
import re
from pyspark.sql.functions import col
import json
from cassandra.cluster import Cluster
# 1. Function to sanitize a single column name
def sanitize_column(name):
"""
Sanitizes a column name by replacing special characters and standardizing the format.
- Replaces '%' with '_percent' and '$' with '_dollar'.
- Converts spaces and special characters into underscores.
- Strips leading/trailing underscores and converts to lowercase.
"""
# Replace '%' and '$' with meaningful labels
name = name.replace('%', '_percent')
name = name.replace('$', '_dollar')
# Replace special characters with underscores
name = re.sub(r'[^\w\s]', '_', name) # Retain alphanumeric and spaces
# Replace spaces with underscores and collapse multiple underscores
name = re.sub(r'\s+', '_', name)
name = re.sub(r'_+', '_', name)
# Remove leading/trailing underscores and convert to lowercase
return name.strip('_').lower()
# 2. Function to apply sanitization to all columns in a DataFrame
def sanitize_dataframe_columns(dataframe):
"""
Sanitizes all column names in a Pandas DataFrame.
"""
sanitized_columns = [sanitize_column(col) for col in dataframe.columns]
dataframe.columns = sanitized_columns
return dataframe
'''
# MinIO Configuration
s3_url = "http://minio:9000"
s3_key = "minio"
s3_secret = "SU2orange!"
s3_bucket = "project"
s3_file_key = "tabn102.30.xlsx"
# Initialize MinIO client
s3 = boto3.client('s3',
endpoint_url=s3_url,
aws_access_key_id=s3_key,
aws_secret_access_key=s3_secret)
# Download Excel file locally from MinIO
'''
local_file = "/home/jovyan/datasets/tabn102.30.xlsx"
#s3.download_file(s3_bucket, s3_file_key, local_file)
# Read the Excel file using pandas
# Specify the sheet name and skip rows if needed
nces_pd = pd.read_excel(local_file, sheet_name="Digest 2022 Table 102.30", skiprows=3)
# Adjust Pandas display options to show all rows
pd.set_option('display.max_rows', None) # Show all rows
pd.set_option('display.max_columns', None) # Show all columns
pd.set_option('display.width', None) # Expand display width to avoid line wrapping
# Display the DataFrame
display(nces_pd)
# Optionally, reset Pandas options to defaults after displaying
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.width')
| 1 | 2 | 3 | 4 | Unnamed: 4 | 5 | Unnamed: 6 | 6 | Unnamed: 8 | 7 | Unnamed: 10 | 8 | Unnamed: 12 | 9 | Unnamed: 14 | 10 | Unnamed: 16 | 11 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | Unnamed: 25 | Unnamed: 26 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | United States | 63600.0 | 68500.0 | 64300.0 | 90.0 | 62300.0 | 50.0 | 63800.0 | 60.0 | 65100.0 | 80.0 | 66700.0 | 60.0 | 66800.0 | 60.0 | 69600.0 | 80.0 | 69700.0 | 80.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | Alabama | 49900.0 | 55700.0 | 51300.0 | 450.0 | 50400.0 | 360.0 | 51200.0 | 520.0 | 52200.0 | 460.0 | 53200.0 | 520.0 | 53800.0 | 510.0 | 54800.0 | 390.0 | 53900.0 | 470.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Alaska | 87600.0 | 84100.0 | 78200.0 | 1530.0 | 80400.0 | 1580.0 | 83900.0 | 1610.0 | 86300.0 | 1530.0 | 80900.0 | 1770.0 | 80200.0 | 1500.0 | 80000.0 | 1740.0 | 77800.0 | 1310.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | Arizona | 58300.0 | 66200.0 | 61600.0 | 550.0 | 58300.0 | 400.0 | 58900.0 | 300.0 | 60500.0 | 440.0 | 62500.0 | 350.0 | 63900.0 | 480.0 | 65800.0 | 290.0 | 69100.0 | 420.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | Arkansas | 44700.0 | 52500.0 | 48700.0 | 510.0 | 47700.0 | 490.0 | 48000.0 | 330.0 | 50100.0 | 630.0 | 50700.0 | 510.0 | 50800.0 | 470.0 | 51900.0 | 560.0 | 52500.0 | 610.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | California | 75700.0 | 77500.0 | 74600.0 | 270.0 | 71900.0 | 270.0 | 73800.0 | 270.0 | 76500.0 | 240.0 | 79400.0 | 200.0 | 81200.0 | 210.0 | 85200.0 | 200.0 | 84900.0 | 330.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 6 | Colorado | 63800.0 | 77000.0 | 70400.0 | 470.0 | 67300.0 | 550.0 | 73100.0 | 620.0 | 74200.0 | 440.0 | 76400.0 | 520.0 | 77600.0 | 430.0 | 81700.0 | 510.0 | 82300.0 | 480.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | Connecticut | 88300.0 | 88000.0 | 84800.0 | 690.0 | 79700.0 | 830.0 | 81600.0 | 540.0 | 82900.0 | 730.0 | 82000.0 | 900.0 | 82400.0 | 600.0 | 83500.0 | 870.0 | 83800.0 | 680.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | Delaware | 73800.0 | 77300.0 | 73000.0 | 1200.0 | 69500.0 | 1130.0 | 70100.0 | 890.0 | 69700.0 | 1020.0 | 69500.0 | 1660.0 | 69900.0 | 1030.0 | 74400.0 | 1050.0 | 71100.0 | 1090.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | District of Columbia | 65000.0 | 65500.0 | 65700.0 | 1640.0 | 75800.0 | 1170.0 | 86500.0 | 1730.0 | 85200.0 | 2340.0 | 91000.0 | 1780.0 | 91900.0 | 2240.0 | 97800.0 | 1610.0 | 90100.0 | 2110.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | Florida | 58100.0 | 63300.0 | 59000.0 | 230.0 | 55300.0 | 250.0 | 56500.0 | 230.0 | 57400.0 | 170.0 | 58100.0 | 240.0 | 59800.0 | 250.0 | 62800.0 | 290.0 | 63100.0 | 260.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 11 | Georgia | 61400.0 | 69200.0 | 63400.0 | 370.0 | 57800.0 | 360.0 | 58600.0 | 260.0 | 60500.0 | 490.0 | 62100.0 | 380.0 | 63400.0 | 470.0 | 65700.0 | 280.0 | 66600.0 | 430.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 12 | Hawaii | 82100.0 | 81300.0 | 80800.0 | 1660.0 | 78500.0 | 1180.0 | 84000.0 | 1400.0 | 84100.0 | 1220.0 | 86000.0 | 1210.0 | 86500.0 | 1320.0 | 88100.0 | 1380.0 | 84900.0 | 1280.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 13 | Idaho | 53400.0 | 61300.0 | 57600.0 | 710.0 | 54200.0 | 760.0 | 55200.0 | 760.0 | 58500.0 | 660.0 | 57700.0 | 590.0 | 60000.0 | 600.0 | 64600.0 | 690.0 | 66500.0 | 1070.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 14 | Illinois | 68200.0 | 76000.0 | 69900.0 | 290.0 | 66000.0 | 330.0 | 68100.0 | 360.0 | 68800.0 | 270.0 | 69600.0 | 370.0 | 70200.0 | 330.0 | 73300.0 | 380.0 | 72200.0 | 350.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 15 | Indiana | 60900.0 | 67800.0 | 61200.0 | 430.0 | 55600.0 | 340.0 | 57800.0 | 270.0 | 59100.0 | 250.0 | 59900.0 | 440.0 | 60100.0 | 340.0 | 61000.0 | 440.0 | 62700.0 | 440.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 16 | Iowa | 55500.0 | 64400.0 | 60600.0 | 440.0 | 59700.0 | 490.0 | 62600.0 | 550.0 | 63500.0 | 480.0 | 64700.0 | 570.0 | 64700.0 | 580.0 | 65400.0 | 480.0 | 65600.0 | 470.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 17 | Kansas | 57700.0 | 66300.0 | 59700.0 | 620.0 | 60100.0 | 660.0 | 61600.0 | 610.0 | 62000.0 | 610.0 | 62400.0 | 470.0 | 62800.0 | 510.0 | 65800.0 | 450.0 | 64100.0 | 580.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 18 | Kentucky | 47700.0 | 54900.0 | 52000.0 | 400.0 | 49900.0 | 380.0 | 51700.0 | 330.0 | 52700.0 | 410.0 | 53500.0 | 420.0 | 54200.0 | 370.0 | 55400.0 | 400.0 | 55600.0 | 430.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 19 | Louisiana | 46400.0 | 53100.0 | 51100.0 | 490.0 | 52900.0 | 540.0 | 52300.0 | 470.0 | 51000.0 | 530.0 | 51000.0 | 430.0 | 51700.0 | 490.0 | 54100.0 | 380.0 | 52100.0 | 460.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 20 | Maine | 58900.0 | 60700.0 | 59500.0 | 820.0 | 57100.0 | 720.0 | 58900.0 | 590.0 | 59900.0 | 950.0 | 62200.0 | 780.0 | 60000.0 | 870.0 | 62400.0 | 1130.0 | 64800.0 | 760.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 21 | Maryland | 83300.0 | 86200.0 | 85700.0 | 500.0 | 85700.0 | 700.0 | 86700.0 | 410.0 | 89100.0 | 510.0 | 89300.0 | 480.0 | 89800.0 | 670.0 | 91900.0 | 600.0 | 90200.0 | 650.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 22 | Massachusetts | 78200.0 | 82400.0 | 79500.0 | 590.0 | 77300.0 | 310.0 | 80800.0 | 430.0 | 85000.0 | 530.0 | 85500.0 | 610.0 | 86100.0 | 640.0 | 91000.0 | 640.0 | 89600.0 | 780.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 23 | Michigan | 65600.0 | 72900.0 | 64000.0 | 380.0 | 56600.0 | 230.0 | 58400.0 | 170.0 | 59300.0 | 280.0 | 60700.0 | 300.0 | 61200.0 | 270.0 | 63100.0 | 370.0 | 63500.0 | 320.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24 | Minnesota | 65400.0 | 76800.0 | 72400.0 | 310.0 | 69100.0 | 350.0 | 72600.0 | 470.0 | 74100.0 | 420.0 | 75600.0 | 440.0 | 75900.0 | 350.0 | 79100.0 | 530.0 | 77700.0 | 560.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25 | Mississippi | 42600.0 | 51100.0 | 45800.0 | 520.0 | 45900.0 | 480.0 | 46400.0 | 380.0 | 47100.0 | 380.0 | 48100.0 | 570.0 | 48200.0 | 520.0 | 48500.0 | 760.0 | 48700.0 | 760.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26 | Missouri | 55800.0 | 61900.0 | 58400.0 | 300.0 | 55200.0 | 380.0 | 57500.0 | 290.0 | 58400.0 | 260.0 | 59300.0 | 420.0 | 58800.0 | 490.0 | 60800.0 | 410.0 | 61800.0 | 410.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 27 | Montana | 48600.0 | 53900.0 | 54700.0 | 820.0 | 53100.0 | 860.0 | 56600.0 | 980.0 | 56500.0 | 750.0 | 59000.0 | 960.0 | 59700.0 | 770.0 | 60600.0 | 780.0 | 63200.0 | 870.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 28 | Nebraska | 55000.0 | 64000.0 | 61000.0 | 640.0 | 60300.0 | 680.0 | 62900.0 | 580.0 | 64300.0 | 530.0 | 66300.0 | 680.0 | 64300.0 | 700.0 | 67000.0 | 600.0 | 66800.0 | 640.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 29 | Nevada | 65600.0 | 72700.0 | 68400.0 | 750.0 | 63500.0 | 600.0 | 60000.0 | 680.0 | 62300.0 | 620.0 | 64100.0 | 640.0 | 63300.0 | 740.0 | 67100.0 | 660.0 | 66300.0 | 610.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 30 | New Hampshire | 76900.0 | 80700.0 | 78900.0 | 840.0 | 76000.0 | 870.0 | 80400.0 | 870.0 | 80100.0 | 980.0 | 81100.0 | 1140.0 | 80900.0 | 1020.0 | 82600.0 | 1280.0 | 88500.0 | 1300.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 31 | New Jersey | 86600.0 | 90000.0 | 85800.0 | 440.0 | 84300.0 | 580.0 | 82600.0 | 420.0 | 85900.0 | 480.0 | 88500.0 | 450.0 | 88200.0 | 450.0 | 90900.0 | 490.0 | 89300.0 | 660.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 32 | New Mexico | 51000.0 | 55700.0 | 52100.0 | 630.0 | 52400.0 | 560.0 | 51900.0 | 600.0 | 52800.0 | 570.0 | 51700.0 | 800.0 | 50900.0 | 590.0 | 55000.0 | 620.0 | 54000.0 | 900.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 33 | New York | 69700.0 | 70800.0 | 68800.0 | 360.0 | 67400.0 | 280.0 | 69600.0 | 210.0 | 71000.0 | 430.0 | 71700.0 | 340.0 | 73200.0 | 420.0 | 76400.0 | 290.0 | 74300.0 | 370.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 34 | North Carolina | 56400.0 | 63900.0 | 56600.0 | 270.0 | 54000.0 | 280.0 | 54700.0 | 360.0 | 57100.0 | 200.0 | 58300.0 | 340.0 | 58100.0 | 380.0 | 60800.0 | 390.0 | 62000.0 | 330.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 35 | North Dakota | 49100.0 | 56400.0 | 57100.0 | 600.0 | 60600.0 | 1190.0 | 69300.0 | 1150.0 | 68500.0 | 1050.0 | 68400.0 | 710.0 | 68900.0 | 1520.0 | 68400.0 | 1940.0 | 66500.0 | 1110.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 36 | Ohio | 60700.0 | 66800.0 | 60500.0 | 290.0 | 56200.0 | 220.0 | 58400.0 | 160.0 | 59100.0 | 190.0 | 59700.0 | 280.0 | 60500.0 | 280.0 | 62100.0 | 360.0 | 62300.0 | 280.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 37 | Oklahoma | 49900.0 | 54500.0 | 51500.0 | 480.0 | 52400.0 | 310.0 | 55500.0 | 360.0 | 55500.0 | 430.0 | 55300.0 | 390.0 | 56000.0 | 300.0 | 57700.0 | 390.0 | 55800.0 | 380.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 38 | Oregon | 57600.0 | 66700.0 | 59700.0 | 490.0 | 58000.0 | 410.0 | 61900.0 | 630.0 | 65000.0 | 590.0 | 66600.0 | 530.0 | 68400.0 | 690.0 | 71100.0 | 560.0 | 71600.0 | 640.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 39 | Pennsylvania | 61500.0 | 65400.0 | 61900.0 | 330.0 | 61400.0 | 330.0 | 63700.0 | 260.0 | 64200.0 | 250.0 | 65400.0 | 300.0 | 65700.0 | 210.0 | 67300.0 | 280.0 | 69000.0 | 310.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 40 | Rhode Island | 68100.0 | 68700.0 | 71600.0 | 1160.0 | 65100.0 | 940.0 | 66400.0 | 1680.0 | 68400.0 | 1090.0 | 70600.0 | 1250.0 | 69400.0 | 1350.0 | 75400.0 | 1000.0 | 74000.0 | 1450.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 41 | South Carolina | 55500.0 | 60500.0 | 54700.0 | 520.0 | 52300.0 | 320.0 | 54000.0 | 380.0 | 55900.0 | 410.0 | 55900.0 | 360.0 | 56400.0 | 410.0 | 59600.0 | 540.0 | 59300.0 | 520.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 42 | South Dakota | 47600.0 | 57600.0 | 56100.0 | 750.0 | 57200.0 | 920.0 | 60600.0 | 760.0 | 61500.0 | 880.0 | 62500.0 | 880.0 | 60700.0 | 950.0 | 63100.0 | 1260.0 | 66100.0 | 1370.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 43 | Tennessee | 52500.0 | 59300.0 | 54100.0 | 410.0 | 51600.0 | 310.0 | 54100.0 | 340.0 | 54800.0 | 460.0 | 56700.0 | 280.0 | 56500.0 | 320.0 | 59400.0 | 420.0 | 59700.0 | 480.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 44 | Texas | 57200.0 | 65100.0 | 58600.0 | 210.0 | 60500.0 | 260.0 | 63600.0 | 220.0 | 63900.0 | 210.0 | 65400.0 | 330.0 | 65400.0 | 180.0 | 67900.0 | 320.0 | 67000.0 | 310.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 45 | Utah | 62300.0 | 74600.0 | 66700.0 | 800.0 | 68200.0 | 470.0 | 71900.0 | 850.0 | 74500.0 | 660.0 | 75600.0 | 830.0 | 77100.0 | 490.0 | 80300.0 | 700.0 | 79400.0 | 760.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 46 | Vermont | 63000.0 | 66600.0 | 63500.0 | 1010.0 | 61500.0 | 1120.0 | 65200.0 | 920.0 | 65100.0 | 1150.0 | 63600.0 | 1400.0 | 65600.0 | 1020.0 | 66800.0 | 1130.0 | 72400.0 | 1350.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 47 | Virginia | 70500.0 | 76100.0 | 75400.0 | 460.0 | 75600.0 | 350.0 | 75800.0 | 380.0 | 76900.0 | 510.0 | 79100.0 | 370.0 | 78300.0 | 450.0 | 81000.0 | 540.0 | 81000.0 | 440.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 48 | Washington | 66000.0 | 74700.0 | 68500.0 | 540.0 | 69300.0 | 420.0 | 73300.0 | 550.0 | 75800.0 | 410.0 | 78500.0 | 390.0 | 79900.0 | 460.0 | 83400.0 | 590.0 | 84200.0 | 510.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 49 | West Virginia | 44000.0 | 48400.0 | 46500.0 | 680.0 | 47600.0 | 680.0 | 48100.0 | 550.0 | 49000.0 | 760.0 | 48000.0 | 940.0 | 47600.0 | 670.0 | 51800.0 | 730.0 | 51200.0 | 660.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 50 | Wisconsin | 62300.0 | 71400.0 | 65500.0 | 330.0 | 61000.0 | 380.0 | 63600.0 | 300.0 | 64100.0 | 380.0 | 65500.0 | 390.0 | 65600.0 | 260.0 | 68000.0 | 390.0 | 67100.0 | 370.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 51 | Wyoming | 57300.0 | 61800.0 | 64300.0 | 1280.0 | 66600.0 | 1430.0 | 68900.0 | 1100.0 | 67600.0 | 1520.0 | 66800.0 | 1180.0 | 66400.0 | 920.0 | 68900.0 | 1360.0 | 65200.0 | 1670.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 52 | \1\Based on 1989 incomes collected in the 1990... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 53 | \2\Based on 1999 incomes collected in the 2000... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 54 | NOTE: Constant dollars adjusted by the Consume... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 55 | SOURCE: U.S. Department of Commerce, Census Bu... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
PART II: Dataset Preprocessing, Transformation, and Storage¶
This will include cleaning data, handling missing values, We'll ensure the data quality and integrity are maintained to ensure reliable analysis.
Description This code performs several essential data exploration and preprocessing tasks on the DataFrames.
Purpose To gather comprehensive information about the DataFrame's structure, statistical summary, and data quality, which is crucial for effective data analysis and preprocessing.
Handling Missing Data¶
# Define the indices of rows to remove
rows_to_remove = [52, 53, 54, 55]
# Drop the rows by index
nces_pd = nces_pd.drop(rows_to_remove, axis=0)
# Reset the index (optional, if you want sequential indices)
nces_pd.reset_index(drop=True, inplace=True)
# Drop all the N/A's
nces_pd = nces_pd.dropna(axis=1)
display(nces_pd)
| 1 | 2 | 3 | 4 | Unnamed: 4 | 5 | Unnamed: 6 | 6 | Unnamed: 8 | 7 | Unnamed: 10 | 8 | Unnamed: 12 | 9 | Unnamed: 14 | 10 | Unnamed: 16 | 11 | Unnamed: 18 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | United States | 63600.0 | 68500.0 | 64300.0 | 90.0 | 62300.0 | 50.0 | 63800.0 | 60.0 | 65100.0 | 80.0 | 66700.0 | 60.0 | 66800.0 | 60.0 | 69600.0 | 80.0 | 69700.0 | 80.0 |
| 1 | Alabama | 49900.0 | 55700.0 | 51300.0 | 450.0 | 50400.0 | 360.0 | 51200.0 | 520.0 | 52200.0 | 460.0 | 53200.0 | 520.0 | 53800.0 | 510.0 | 54800.0 | 390.0 | 53900.0 | 470.0 |
| 2 | Alaska | 87600.0 | 84100.0 | 78200.0 | 1530.0 | 80400.0 | 1580.0 | 83900.0 | 1610.0 | 86300.0 | 1530.0 | 80900.0 | 1770.0 | 80200.0 | 1500.0 | 80000.0 | 1740.0 | 77800.0 | 1310.0 |
| 3 | Arizona | 58300.0 | 66200.0 | 61600.0 | 550.0 | 58300.0 | 400.0 | 58900.0 | 300.0 | 60500.0 | 440.0 | 62500.0 | 350.0 | 63900.0 | 480.0 | 65800.0 | 290.0 | 69100.0 | 420.0 |
| 4 | Arkansas | 44700.0 | 52500.0 | 48700.0 | 510.0 | 47700.0 | 490.0 | 48000.0 | 330.0 | 50100.0 | 630.0 | 50700.0 | 510.0 | 50800.0 | 470.0 | 51900.0 | 560.0 | 52500.0 | 610.0 |
| 5 | California | 75700.0 | 77500.0 | 74600.0 | 270.0 | 71900.0 | 270.0 | 73800.0 | 270.0 | 76500.0 | 240.0 | 79400.0 | 200.0 | 81200.0 | 210.0 | 85200.0 | 200.0 | 84900.0 | 330.0 |
| 6 | Colorado | 63800.0 | 77000.0 | 70400.0 | 470.0 | 67300.0 | 550.0 | 73100.0 | 620.0 | 74200.0 | 440.0 | 76400.0 | 520.0 | 77600.0 | 430.0 | 81700.0 | 510.0 | 82300.0 | 480.0 |
| 7 | Connecticut | 88300.0 | 88000.0 | 84800.0 | 690.0 | 79700.0 | 830.0 | 81600.0 | 540.0 | 82900.0 | 730.0 | 82000.0 | 900.0 | 82400.0 | 600.0 | 83500.0 | 870.0 | 83800.0 | 680.0 |
| 8 | Delaware | 73800.0 | 77300.0 | 73000.0 | 1200.0 | 69500.0 | 1130.0 | 70100.0 | 890.0 | 69700.0 | 1020.0 | 69500.0 | 1660.0 | 69900.0 | 1030.0 | 74400.0 | 1050.0 | 71100.0 | 1090.0 |
| 9 | District of Columbia | 65000.0 | 65500.0 | 65700.0 | 1640.0 | 75800.0 | 1170.0 | 86500.0 | 1730.0 | 85200.0 | 2340.0 | 91000.0 | 1780.0 | 91900.0 | 2240.0 | 97800.0 | 1610.0 | 90100.0 | 2110.0 |
| 10 | Florida | 58100.0 | 63300.0 | 59000.0 | 230.0 | 55300.0 | 250.0 | 56500.0 | 230.0 | 57400.0 | 170.0 | 58100.0 | 240.0 | 59800.0 | 250.0 | 62800.0 | 290.0 | 63100.0 | 260.0 |
| 11 | Georgia | 61400.0 | 69200.0 | 63400.0 | 370.0 | 57800.0 | 360.0 | 58600.0 | 260.0 | 60500.0 | 490.0 | 62100.0 | 380.0 | 63400.0 | 470.0 | 65700.0 | 280.0 | 66600.0 | 430.0 |
| 12 | Hawaii | 82100.0 | 81300.0 | 80800.0 | 1660.0 | 78500.0 | 1180.0 | 84000.0 | 1400.0 | 84100.0 | 1220.0 | 86000.0 | 1210.0 | 86500.0 | 1320.0 | 88100.0 | 1380.0 | 84900.0 | 1280.0 |
| 13 | Idaho | 53400.0 | 61300.0 | 57600.0 | 710.0 | 54200.0 | 760.0 | 55200.0 | 760.0 | 58500.0 | 660.0 | 57700.0 | 590.0 | 60000.0 | 600.0 | 64600.0 | 690.0 | 66500.0 | 1070.0 |
| 14 | Illinois | 68200.0 | 76000.0 | 69900.0 | 290.0 | 66000.0 | 330.0 | 68100.0 | 360.0 | 68800.0 | 270.0 | 69600.0 | 370.0 | 70200.0 | 330.0 | 73300.0 | 380.0 | 72200.0 | 350.0 |
| 15 | Indiana | 60900.0 | 67800.0 | 61200.0 | 430.0 | 55600.0 | 340.0 | 57800.0 | 270.0 | 59100.0 | 250.0 | 59900.0 | 440.0 | 60100.0 | 340.0 | 61000.0 | 440.0 | 62700.0 | 440.0 |
| 16 | Iowa | 55500.0 | 64400.0 | 60600.0 | 440.0 | 59700.0 | 490.0 | 62600.0 | 550.0 | 63500.0 | 480.0 | 64700.0 | 570.0 | 64700.0 | 580.0 | 65400.0 | 480.0 | 65600.0 | 470.0 |
| 17 | Kansas | 57700.0 | 66300.0 | 59700.0 | 620.0 | 60100.0 | 660.0 | 61600.0 | 610.0 | 62000.0 | 610.0 | 62400.0 | 470.0 | 62800.0 | 510.0 | 65800.0 | 450.0 | 64100.0 | 580.0 |
| 18 | Kentucky | 47700.0 | 54900.0 | 52000.0 | 400.0 | 49900.0 | 380.0 | 51700.0 | 330.0 | 52700.0 | 410.0 | 53500.0 | 420.0 | 54200.0 | 370.0 | 55400.0 | 400.0 | 55600.0 | 430.0 |
| 19 | Louisiana | 46400.0 | 53100.0 | 51100.0 | 490.0 | 52900.0 | 540.0 | 52300.0 | 470.0 | 51000.0 | 530.0 | 51000.0 | 430.0 | 51700.0 | 490.0 | 54100.0 | 380.0 | 52100.0 | 460.0 |
| 20 | Maine | 58900.0 | 60700.0 | 59500.0 | 820.0 | 57100.0 | 720.0 | 58900.0 | 590.0 | 59900.0 | 950.0 | 62200.0 | 780.0 | 60000.0 | 870.0 | 62400.0 | 1130.0 | 64800.0 | 760.0 |
| 21 | Maryland | 83300.0 | 86200.0 | 85700.0 | 500.0 | 85700.0 | 700.0 | 86700.0 | 410.0 | 89100.0 | 510.0 | 89300.0 | 480.0 | 89800.0 | 670.0 | 91900.0 | 600.0 | 90200.0 | 650.0 |
| 22 | Massachusetts | 78200.0 | 82400.0 | 79500.0 | 590.0 | 77300.0 | 310.0 | 80800.0 | 430.0 | 85000.0 | 530.0 | 85500.0 | 610.0 | 86100.0 | 640.0 | 91000.0 | 640.0 | 89600.0 | 780.0 |
| 23 | Michigan | 65600.0 | 72900.0 | 64000.0 | 380.0 | 56600.0 | 230.0 | 58400.0 | 170.0 | 59300.0 | 280.0 | 60700.0 | 300.0 | 61200.0 | 270.0 | 63100.0 | 370.0 | 63500.0 | 320.0 |
| 24 | Minnesota | 65400.0 | 76800.0 | 72400.0 | 310.0 | 69100.0 | 350.0 | 72600.0 | 470.0 | 74100.0 | 420.0 | 75600.0 | 440.0 | 75900.0 | 350.0 | 79100.0 | 530.0 | 77700.0 | 560.0 |
| 25 | Mississippi | 42600.0 | 51100.0 | 45800.0 | 520.0 | 45900.0 | 480.0 | 46400.0 | 380.0 | 47100.0 | 380.0 | 48100.0 | 570.0 | 48200.0 | 520.0 | 48500.0 | 760.0 | 48700.0 | 760.0 |
| 26 | Missouri | 55800.0 | 61900.0 | 58400.0 | 300.0 | 55200.0 | 380.0 | 57500.0 | 290.0 | 58400.0 | 260.0 | 59300.0 | 420.0 | 58800.0 | 490.0 | 60800.0 | 410.0 | 61800.0 | 410.0 |
| 27 | Montana | 48600.0 | 53900.0 | 54700.0 | 820.0 | 53100.0 | 860.0 | 56600.0 | 980.0 | 56500.0 | 750.0 | 59000.0 | 960.0 | 59700.0 | 770.0 | 60600.0 | 780.0 | 63200.0 | 870.0 |
| 28 | Nebraska | 55000.0 | 64000.0 | 61000.0 | 640.0 | 60300.0 | 680.0 | 62900.0 | 580.0 | 64300.0 | 530.0 | 66300.0 | 680.0 | 64300.0 | 700.0 | 67000.0 | 600.0 | 66800.0 | 640.0 |
| 29 | Nevada | 65600.0 | 72700.0 | 68400.0 | 750.0 | 63500.0 | 600.0 | 60000.0 | 680.0 | 62300.0 | 620.0 | 64100.0 | 640.0 | 63300.0 | 740.0 | 67100.0 | 660.0 | 66300.0 | 610.0 |
| 30 | New Hampshire | 76900.0 | 80700.0 | 78900.0 | 840.0 | 76000.0 | 870.0 | 80400.0 | 870.0 | 80100.0 | 980.0 | 81100.0 | 1140.0 | 80900.0 | 1020.0 | 82600.0 | 1280.0 | 88500.0 | 1300.0 |
| 31 | New Jersey | 86600.0 | 90000.0 | 85800.0 | 440.0 | 84300.0 | 580.0 | 82600.0 | 420.0 | 85900.0 | 480.0 | 88500.0 | 450.0 | 88200.0 | 450.0 | 90900.0 | 490.0 | 89300.0 | 660.0 |
| 32 | New Mexico | 51000.0 | 55700.0 | 52100.0 | 630.0 | 52400.0 | 560.0 | 51900.0 | 600.0 | 52800.0 | 570.0 | 51700.0 | 800.0 | 50900.0 | 590.0 | 55000.0 | 620.0 | 54000.0 | 900.0 |
| 33 | New York | 69700.0 | 70800.0 | 68800.0 | 360.0 | 67400.0 | 280.0 | 69600.0 | 210.0 | 71000.0 | 430.0 | 71700.0 | 340.0 | 73200.0 | 420.0 | 76400.0 | 290.0 | 74300.0 | 370.0 |
| 34 | North Carolina | 56400.0 | 63900.0 | 56600.0 | 270.0 | 54000.0 | 280.0 | 54700.0 | 360.0 | 57100.0 | 200.0 | 58300.0 | 340.0 | 58100.0 | 380.0 | 60800.0 | 390.0 | 62000.0 | 330.0 |
| 35 | North Dakota | 49100.0 | 56400.0 | 57100.0 | 600.0 | 60600.0 | 1190.0 | 69300.0 | 1150.0 | 68500.0 | 1050.0 | 68400.0 | 710.0 | 68900.0 | 1520.0 | 68400.0 | 1940.0 | 66500.0 | 1110.0 |
| 36 | Ohio | 60700.0 | 66800.0 | 60500.0 | 290.0 | 56200.0 | 220.0 | 58400.0 | 160.0 | 59100.0 | 190.0 | 59700.0 | 280.0 | 60500.0 | 280.0 | 62100.0 | 360.0 | 62300.0 | 280.0 |
| 37 | Oklahoma | 49900.0 | 54500.0 | 51500.0 | 480.0 | 52400.0 | 310.0 | 55500.0 | 360.0 | 55500.0 | 430.0 | 55300.0 | 390.0 | 56000.0 | 300.0 | 57700.0 | 390.0 | 55800.0 | 380.0 |
| 38 | Oregon | 57600.0 | 66700.0 | 59700.0 | 490.0 | 58000.0 | 410.0 | 61900.0 | 630.0 | 65000.0 | 590.0 | 66600.0 | 530.0 | 68400.0 | 690.0 | 71100.0 | 560.0 | 71600.0 | 640.0 |
| 39 | Pennsylvania | 61500.0 | 65400.0 | 61900.0 | 330.0 | 61400.0 | 330.0 | 63700.0 | 260.0 | 64200.0 | 250.0 | 65400.0 | 300.0 | 65700.0 | 210.0 | 67300.0 | 280.0 | 69000.0 | 310.0 |
| 40 | Rhode Island | 68100.0 | 68700.0 | 71600.0 | 1160.0 | 65100.0 | 940.0 | 66400.0 | 1680.0 | 68400.0 | 1090.0 | 70600.0 | 1250.0 | 69400.0 | 1350.0 | 75400.0 | 1000.0 | 74000.0 | 1450.0 |
| 41 | South Carolina | 55500.0 | 60500.0 | 54700.0 | 520.0 | 52300.0 | 320.0 | 54000.0 | 380.0 | 55900.0 | 410.0 | 55900.0 | 360.0 | 56400.0 | 410.0 | 59600.0 | 540.0 | 59300.0 | 520.0 |
| 42 | South Dakota | 47600.0 | 57600.0 | 56100.0 | 750.0 | 57200.0 | 920.0 | 60600.0 | 760.0 | 61500.0 | 880.0 | 62500.0 | 880.0 | 60700.0 | 950.0 | 63100.0 | 1260.0 | 66100.0 | 1370.0 |
| 43 | Tennessee | 52500.0 | 59300.0 | 54100.0 | 410.0 | 51600.0 | 310.0 | 54100.0 | 340.0 | 54800.0 | 460.0 | 56700.0 | 280.0 | 56500.0 | 320.0 | 59400.0 | 420.0 | 59700.0 | 480.0 |
| 44 | Texas | 57200.0 | 65100.0 | 58600.0 | 210.0 | 60500.0 | 260.0 | 63600.0 | 220.0 | 63900.0 | 210.0 | 65400.0 | 330.0 | 65400.0 | 180.0 | 67900.0 | 320.0 | 67000.0 | 310.0 |
| 45 | Utah | 62300.0 | 74600.0 | 66700.0 | 800.0 | 68200.0 | 470.0 | 71900.0 | 850.0 | 74500.0 | 660.0 | 75600.0 | 830.0 | 77100.0 | 490.0 | 80300.0 | 700.0 | 79400.0 | 760.0 |
| 46 | Vermont | 63000.0 | 66600.0 | 63500.0 | 1010.0 | 61500.0 | 1120.0 | 65200.0 | 920.0 | 65100.0 | 1150.0 | 63600.0 | 1400.0 | 65600.0 | 1020.0 | 66800.0 | 1130.0 | 72400.0 | 1350.0 |
| 47 | Virginia | 70500.0 | 76100.0 | 75400.0 | 460.0 | 75600.0 | 350.0 | 75800.0 | 380.0 | 76900.0 | 510.0 | 79100.0 | 370.0 | 78300.0 | 450.0 | 81000.0 | 540.0 | 81000.0 | 440.0 |
| 48 | Washington | 66000.0 | 74700.0 | 68500.0 | 540.0 | 69300.0 | 420.0 | 73300.0 | 550.0 | 75800.0 | 410.0 | 78500.0 | 390.0 | 79900.0 | 460.0 | 83400.0 | 590.0 | 84200.0 | 510.0 |
| 49 | West Virginia | 44000.0 | 48400.0 | 46500.0 | 680.0 | 47600.0 | 680.0 | 48100.0 | 550.0 | 49000.0 | 760.0 | 48000.0 | 940.0 | 47600.0 | 670.0 | 51800.0 | 730.0 | 51200.0 | 660.0 |
| 50 | Wisconsin | 62300.0 | 71400.0 | 65500.0 | 330.0 | 61000.0 | 380.0 | 63600.0 | 300.0 | 64100.0 | 380.0 | 65500.0 | 390.0 | 65600.0 | 260.0 | 68000.0 | 390.0 | 67100.0 | 370.0 |
| 51 | Wyoming | 57300.0 | 61800.0 | 64300.0 | 1280.0 | 66600.0 | 1430.0 | 68900.0 | 1100.0 | 67600.0 | 1520.0 | 66800.0 | 1180.0 | 66400.0 | 920.0 | 68900.0 | 1360.0 | 65200.0 | 1670.0 |
print("Column names:", nces_pd.columns)
Column names: Index([ 1, '2', 3, 4,
'Unnamed: 4', 5, 'Unnamed: 6', 6,
'Unnamed: 8', 7, 'Unnamed: 10', 8,
'Unnamed: 12', 9, 'Unnamed: 14', 10,
'Unnamed: 16', 11, 'Unnamed: 18'],
dtype='object')
Standardizing Column Names for Consistent and Readable Data¶
# Rename columns in a Pandas DataFrame
# Sample column mapping
column_mapping = {
1: "state_name",
"2":"Median Income 1990",
3:"Median Income 2000",
4:"Median Income 2005",
"Unnamed: 4": "Std Error 2005",
5:"Median Income 2010",
"Unnamed: 6": "Std Error 2010",
6:"Median Income 2015",
"Unnamed: 8": "Std Error 2015",
7:"Median Income 2016",
"Unnamed: 10": "Std Error 2016",
8:"Median Income 2017",
"Unnamed: 12": "Std Error 2017",
9:"Median Income 2018",
"Unnamed: 14": "Std Error 2018",
10:"Median Income 2019",
"Unnamed: 16": "Std Error 2019",
11:"Median Income 2021",
"Unnamed: 18": "Std Error 2021"
}
# Apply mapping and sanitize column names
sanitized_column_mapping = {key: sanitize_column(value) for key, value in column_mapping.items()}
# Sample DataFrame with original column names
original_columns = list(column_mapping.keys())
nces_df = pd.DataFrame(columns=original_columns)
# Rename and sanitize columns
nces_df = nces_df.rename(columns=sanitized_column_mapping)
# Display the new column names
#print("Sanitized Column Names:")
#print(list(nces_df.columns))
# Rename and sanitize columns in the original DataFrame (bls_pd)
nces_pd = nces_pd.rename(columns=sanitized_column_mapping)
# Adjust Pandas display options to show all rows
pd.set_option('display.max_rows', None) # Show all rows
pd.set_option('display.max_columns', None) # Show all columns
pd.set_option('display.width', None) # Expand display width to avoid line wrapping
# Display the DataFrame
display(nces_pd)
| state_name | median_income_1990 | median_income_2000 | median_income_2005 | std_error_2005 | median_income_2010 | std_error_2010 | median_income_2015 | std_error_2015 | median_income_2016 | std_error_2016 | median_income_2017 | std_error_2017 | median_income_2018 | std_error_2018 | median_income_2019 | std_error_2019 | median_income_2021 | std_error_2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | United States | 63600.0 | 68500.0 | 64300.0 | 90.0 | 62300.0 | 50.0 | 63800.0 | 60.0 | 65100.0 | 80.0 | 66700.0 | 60.0 | 66800.0 | 60.0 | 69600.0 | 80.0 | 69700.0 | 80.0 |
| 1 | Alabama | 49900.0 | 55700.0 | 51300.0 | 450.0 | 50400.0 | 360.0 | 51200.0 | 520.0 | 52200.0 | 460.0 | 53200.0 | 520.0 | 53800.0 | 510.0 | 54800.0 | 390.0 | 53900.0 | 470.0 |
| 2 | Alaska | 87600.0 | 84100.0 | 78200.0 | 1530.0 | 80400.0 | 1580.0 | 83900.0 | 1610.0 | 86300.0 | 1530.0 | 80900.0 | 1770.0 | 80200.0 | 1500.0 | 80000.0 | 1740.0 | 77800.0 | 1310.0 |
| 3 | Arizona | 58300.0 | 66200.0 | 61600.0 | 550.0 | 58300.0 | 400.0 | 58900.0 | 300.0 | 60500.0 | 440.0 | 62500.0 | 350.0 | 63900.0 | 480.0 | 65800.0 | 290.0 | 69100.0 | 420.0 |
| 4 | Arkansas | 44700.0 | 52500.0 | 48700.0 | 510.0 | 47700.0 | 490.0 | 48000.0 | 330.0 | 50100.0 | 630.0 | 50700.0 | 510.0 | 50800.0 | 470.0 | 51900.0 | 560.0 | 52500.0 | 610.0 |
| 5 | California | 75700.0 | 77500.0 | 74600.0 | 270.0 | 71900.0 | 270.0 | 73800.0 | 270.0 | 76500.0 | 240.0 | 79400.0 | 200.0 | 81200.0 | 210.0 | 85200.0 | 200.0 | 84900.0 | 330.0 |
| 6 | Colorado | 63800.0 | 77000.0 | 70400.0 | 470.0 | 67300.0 | 550.0 | 73100.0 | 620.0 | 74200.0 | 440.0 | 76400.0 | 520.0 | 77600.0 | 430.0 | 81700.0 | 510.0 | 82300.0 | 480.0 |
| 7 | Connecticut | 88300.0 | 88000.0 | 84800.0 | 690.0 | 79700.0 | 830.0 | 81600.0 | 540.0 | 82900.0 | 730.0 | 82000.0 | 900.0 | 82400.0 | 600.0 | 83500.0 | 870.0 | 83800.0 | 680.0 |
| 8 | Delaware | 73800.0 | 77300.0 | 73000.0 | 1200.0 | 69500.0 | 1130.0 | 70100.0 | 890.0 | 69700.0 | 1020.0 | 69500.0 | 1660.0 | 69900.0 | 1030.0 | 74400.0 | 1050.0 | 71100.0 | 1090.0 |
| 9 | District of Columbia | 65000.0 | 65500.0 | 65700.0 | 1640.0 | 75800.0 | 1170.0 | 86500.0 | 1730.0 | 85200.0 | 2340.0 | 91000.0 | 1780.0 | 91900.0 | 2240.0 | 97800.0 | 1610.0 | 90100.0 | 2110.0 |
| 10 | Florida | 58100.0 | 63300.0 | 59000.0 | 230.0 | 55300.0 | 250.0 | 56500.0 | 230.0 | 57400.0 | 170.0 | 58100.0 | 240.0 | 59800.0 | 250.0 | 62800.0 | 290.0 | 63100.0 | 260.0 |
| 11 | Georgia | 61400.0 | 69200.0 | 63400.0 | 370.0 | 57800.0 | 360.0 | 58600.0 | 260.0 | 60500.0 | 490.0 | 62100.0 | 380.0 | 63400.0 | 470.0 | 65700.0 | 280.0 | 66600.0 | 430.0 |
| 12 | Hawaii | 82100.0 | 81300.0 | 80800.0 | 1660.0 | 78500.0 | 1180.0 | 84000.0 | 1400.0 | 84100.0 | 1220.0 | 86000.0 | 1210.0 | 86500.0 | 1320.0 | 88100.0 | 1380.0 | 84900.0 | 1280.0 |
| 13 | Idaho | 53400.0 | 61300.0 | 57600.0 | 710.0 | 54200.0 | 760.0 | 55200.0 | 760.0 | 58500.0 | 660.0 | 57700.0 | 590.0 | 60000.0 | 600.0 | 64600.0 | 690.0 | 66500.0 | 1070.0 |
| 14 | Illinois | 68200.0 | 76000.0 | 69900.0 | 290.0 | 66000.0 | 330.0 | 68100.0 | 360.0 | 68800.0 | 270.0 | 69600.0 | 370.0 | 70200.0 | 330.0 | 73300.0 | 380.0 | 72200.0 | 350.0 |
| 15 | Indiana | 60900.0 | 67800.0 | 61200.0 | 430.0 | 55600.0 | 340.0 | 57800.0 | 270.0 | 59100.0 | 250.0 | 59900.0 | 440.0 | 60100.0 | 340.0 | 61000.0 | 440.0 | 62700.0 | 440.0 |
| 16 | Iowa | 55500.0 | 64400.0 | 60600.0 | 440.0 | 59700.0 | 490.0 | 62600.0 | 550.0 | 63500.0 | 480.0 | 64700.0 | 570.0 | 64700.0 | 580.0 | 65400.0 | 480.0 | 65600.0 | 470.0 |
| 17 | Kansas | 57700.0 | 66300.0 | 59700.0 | 620.0 | 60100.0 | 660.0 | 61600.0 | 610.0 | 62000.0 | 610.0 | 62400.0 | 470.0 | 62800.0 | 510.0 | 65800.0 | 450.0 | 64100.0 | 580.0 |
| 18 | Kentucky | 47700.0 | 54900.0 | 52000.0 | 400.0 | 49900.0 | 380.0 | 51700.0 | 330.0 | 52700.0 | 410.0 | 53500.0 | 420.0 | 54200.0 | 370.0 | 55400.0 | 400.0 | 55600.0 | 430.0 |
| 19 | Louisiana | 46400.0 | 53100.0 | 51100.0 | 490.0 | 52900.0 | 540.0 | 52300.0 | 470.0 | 51000.0 | 530.0 | 51000.0 | 430.0 | 51700.0 | 490.0 | 54100.0 | 380.0 | 52100.0 | 460.0 |
| 20 | Maine | 58900.0 | 60700.0 | 59500.0 | 820.0 | 57100.0 | 720.0 | 58900.0 | 590.0 | 59900.0 | 950.0 | 62200.0 | 780.0 | 60000.0 | 870.0 | 62400.0 | 1130.0 | 64800.0 | 760.0 |
| 21 | Maryland | 83300.0 | 86200.0 | 85700.0 | 500.0 | 85700.0 | 700.0 | 86700.0 | 410.0 | 89100.0 | 510.0 | 89300.0 | 480.0 | 89800.0 | 670.0 | 91900.0 | 600.0 | 90200.0 | 650.0 |
| 22 | Massachusetts | 78200.0 | 82400.0 | 79500.0 | 590.0 | 77300.0 | 310.0 | 80800.0 | 430.0 | 85000.0 | 530.0 | 85500.0 | 610.0 | 86100.0 | 640.0 | 91000.0 | 640.0 | 89600.0 | 780.0 |
| 23 | Michigan | 65600.0 | 72900.0 | 64000.0 | 380.0 | 56600.0 | 230.0 | 58400.0 | 170.0 | 59300.0 | 280.0 | 60700.0 | 300.0 | 61200.0 | 270.0 | 63100.0 | 370.0 | 63500.0 | 320.0 |
| 24 | Minnesota | 65400.0 | 76800.0 | 72400.0 | 310.0 | 69100.0 | 350.0 | 72600.0 | 470.0 | 74100.0 | 420.0 | 75600.0 | 440.0 | 75900.0 | 350.0 | 79100.0 | 530.0 | 77700.0 | 560.0 |
| 25 | Mississippi | 42600.0 | 51100.0 | 45800.0 | 520.0 | 45900.0 | 480.0 | 46400.0 | 380.0 | 47100.0 | 380.0 | 48100.0 | 570.0 | 48200.0 | 520.0 | 48500.0 | 760.0 | 48700.0 | 760.0 |
| 26 | Missouri | 55800.0 | 61900.0 | 58400.0 | 300.0 | 55200.0 | 380.0 | 57500.0 | 290.0 | 58400.0 | 260.0 | 59300.0 | 420.0 | 58800.0 | 490.0 | 60800.0 | 410.0 | 61800.0 | 410.0 |
| 27 | Montana | 48600.0 | 53900.0 | 54700.0 | 820.0 | 53100.0 | 860.0 | 56600.0 | 980.0 | 56500.0 | 750.0 | 59000.0 | 960.0 | 59700.0 | 770.0 | 60600.0 | 780.0 | 63200.0 | 870.0 |
| 28 | Nebraska | 55000.0 | 64000.0 | 61000.0 | 640.0 | 60300.0 | 680.0 | 62900.0 | 580.0 | 64300.0 | 530.0 | 66300.0 | 680.0 | 64300.0 | 700.0 | 67000.0 | 600.0 | 66800.0 | 640.0 |
| 29 | Nevada | 65600.0 | 72700.0 | 68400.0 | 750.0 | 63500.0 | 600.0 | 60000.0 | 680.0 | 62300.0 | 620.0 | 64100.0 | 640.0 | 63300.0 | 740.0 | 67100.0 | 660.0 | 66300.0 | 610.0 |
| 30 | New Hampshire | 76900.0 | 80700.0 | 78900.0 | 840.0 | 76000.0 | 870.0 | 80400.0 | 870.0 | 80100.0 | 980.0 | 81100.0 | 1140.0 | 80900.0 | 1020.0 | 82600.0 | 1280.0 | 88500.0 | 1300.0 |
| 31 | New Jersey | 86600.0 | 90000.0 | 85800.0 | 440.0 | 84300.0 | 580.0 | 82600.0 | 420.0 | 85900.0 | 480.0 | 88500.0 | 450.0 | 88200.0 | 450.0 | 90900.0 | 490.0 | 89300.0 | 660.0 |
| 32 | New Mexico | 51000.0 | 55700.0 | 52100.0 | 630.0 | 52400.0 | 560.0 | 51900.0 | 600.0 | 52800.0 | 570.0 | 51700.0 | 800.0 | 50900.0 | 590.0 | 55000.0 | 620.0 | 54000.0 | 900.0 |
| 33 | New York | 69700.0 | 70800.0 | 68800.0 | 360.0 | 67400.0 | 280.0 | 69600.0 | 210.0 | 71000.0 | 430.0 | 71700.0 | 340.0 | 73200.0 | 420.0 | 76400.0 | 290.0 | 74300.0 | 370.0 |
| 34 | North Carolina | 56400.0 | 63900.0 | 56600.0 | 270.0 | 54000.0 | 280.0 | 54700.0 | 360.0 | 57100.0 | 200.0 | 58300.0 | 340.0 | 58100.0 | 380.0 | 60800.0 | 390.0 | 62000.0 | 330.0 |
| 35 | North Dakota | 49100.0 | 56400.0 | 57100.0 | 600.0 | 60600.0 | 1190.0 | 69300.0 | 1150.0 | 68500.0 | 1050.0 | 68400.0 | 710.0 | 68900.0 | 1520.0 | 68400.0 | 1940.0 | 66500.0 | 1110.0 |
| 36 | Ohio | 60700.0 | 66800.0 | 60500.0 | 290.0 | 56200.0 | 220.0 | 58400.0 | 160.0 | 59100.0 | 190.0 | 59700.0 | 280.0 | 60500.0 | 280.0 | 62100.0 | 360.0 | 62300.0 | 280.0 |
| 37 | Oklahoma | 49900.0 | 54500.0 | 51500.0 | 480.0 | 52400.0 | 310.0 | 55500.0 | 360.0 | 55500.0 | 430.0 | 55300.0 | 390.0 | 56000.0 | 300.0 | 57700.0 | 390.0 | 55800.0 | 380.0 |
| 38 | Oregon | 57600.0 | 66700.0 | 59700.0 | 490.0 | 58000.0 | 410.0 | 61900.0 | 630.0 | 65000.0 | 590.0 | 66600.0 | 530.0 | 68400.0 | 690.0 | 71100.0 | 560.0 | 71600.0 | 640.0 |
| 39 | Pennsylvania | 61500.0 | 65400.0 | 61900.0 | 330.0 | 61400.0 | 330.0 | 63700.0 | 260.0 | 64200.0 | 250.0 | 65400.0 | 300.0 | 65700.0 | 210.0 | 67300.0 | 280.0 | 69000.0 | 310.0 |
| 40 | Rhode Island | 68100.0 | 68700.0 | 71600.0 | 1160.0 | 65100.0 | 940.0 | 66400.0 | 1680.0 | 68400.0 | 1090.0 | 70600.0 | 1250.0 | 69400.0 | 1350.0 | 75400.0 | 1000.0 | 74000.0 | 1450.0 |
| 41 | South Carolina | 55500.0 | 60500.0 | 54700.0 | 520.0 | 52300.0 | 320.0 | 54000.0 | 380.0 | 55900.0 | 410.0 | 55900.0 | 360.0 | 56400.0 | 410.0 | 59600.0 | 540.0 | 59300.0 | 520.0 |
| 42 | South Dakota | 47600.0 | 57600.0 | 56100.0 | 750.0 | 57200.0 | 920.0 | 60600.0 | 760.0 | 61500.0 | 880.0 | 62500.0 | 880.0 | 60700.0 | 950.0 | 63100.0 | 1260.0 | 66100.0 | 1370.0 |
| 43 | Tennessee | 52500.0 | 59300.0 | 54100.0 | 410.0 | 51600.0 | 310.0 | 54100.0 | 340.0 | 54800.0 | 460.0 | 56700.0 | 280.0 | 56500.0 | 320.0 | 59400.0 | 420.0 | 59700.0 | 480.0 |
| 44 | Texas | 57200.0 | 65100.0 | 58600.0 | 210.0 | 60500.0 | 260.0 | 63600.0 | 220.0 | 63900.0 | 210.0 | 65400.0 | 330.0 | 65400.0 | 180.0 | 67900.0 | 320.0 | 67000.0 | 310.0 |
| 45 | Utah | 62300.0 | 74600.0 | 66700.0 | 800.0 | 68200.0 | 470.0 | 71900.0 | 850.0 | 74500.0 | 660.0 | 75600.0 | 830.0 | 77100.0 | 490.0 | 80300.0 | 700.0 | 79400.0 | 760.0 |
| 46 | Vermont | 63000.0 | 66600.0 | 63500.0 | 1010.0 | 61500.0 | 1120.0 | 65200.0 | 920.0 | 65100.0 | 1150.0 | 63600.0 | 1400.0 | 65600.0 | 1020.0 | 66800.0 | 1130.0 | 72400.0 | 1350.0 |
| 47 | Virginia | 70500.0 | 76100.0 | 75400.0 | 460.0 | 75600.0 | 350.0 | 75800.0 | 380.0 | 76900.0 | 510.0 | 79100.0 | 370.0 | 78300.0 | 450.0 | 81000.0 | 540.0 | 81000.0 | 440.0 |
| 48 | Washington | 66000.0 | 74700.0 | 68500.0 | 540.0 | 69300.0 | 420.0 | 73300.0 | 550.0 | 75800.0 | 410.0 | 78500.0 | 390.0 | 79900.0 | 460.0 | 83400.0 | 590.0 | 84200.0 | 510.0 |
| 49 | West Virginia | 44000.0 | 48400.0 | 46500.0 | 680.0 | 47600.0 | 680.0 | 48100.0 | 550.0 | 49000.0 | 760.0 | 48000.0 | 940.0 | 47600.0 | 670.0 | 51800.0 | 730.0 | 51200.0 | 660.0 |
| 50 | Wisconsin | 62300.0 | 71400.0 | 65500.0 | 330.0 | 61000.0 | 380.0 | 63600.0 | 300.0 | 64100.0 | 380.0 | 65500.0 | 390.0 | 65600.0 | 260.0 | 68000.0 | 390.0 | 67100.0 | 370.0 |
| 51 | Wyoming | 57300.0 | 61800.0 | 64300.0 | 1280.0 | 66600.0 | 1430.0 | 68900.0 | 1100.0 | 67600.0 | 1520.0 | 66800.0 | 1180.0 | 66400.0 | 920.0 | 68900.0 | 1360.0 | 65200.0 | 1670.0 |
cassandra_host = "cassandra"
# Create a SparkSession with Cassandra configuration
spark = SparkSession.builder \
.appName("WriteToCassandra") \
.config("spark.cassandra.connection.host", cassandra_host) \
.config("spark.jars.packages","com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.1.0")\
.config("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions") \
.getOrCreate()
nces_df = spark.createDataFrame(nces_pd)
nces_df.printSchema()
root |-- state_name: string (nullable = true) |-- median_income_1990: double (nullable = true) |-- median_income_2000: double (nullable = true) |-- median_income_2005: double (nullable = true) |-- std_error_2005: double (nullable = true) |-- median_income_2010: double (nullable = true) |-- std_error_2010: double (nullable = true) |-- median_income_2015: double (nullable = true) |-- std_error_2015: double (nullable = true) |-- median_income_2016: double (nullable = true) |-- std_error_2016: double (nullable = true) |-- median_income_2017: double (nullable = true) |-- std_error_2017: double (nullable = true) |-- median_income_2018: double (nullable = true) |-- std_error_2018: double (nullable = true) |-- median_income_2019: double (nullable = true) |-- std_error_2019: double (nullable = true) |-- median_income_2021: double (nullable = true) |-- std_error_2021: double (nullable = true)
Storing data into Cassandra¶
Automate Keyspace and Table Creation¶
This code demonstrates how to create a keyspace and table in Cassandra programmatically using Spark, eliminating the need to manually create them via Cassandra's cqlsh command-line interface.
from cassandra.cluster import Cluster
# Connect to Cassandra
cluster = Cluster(["cassandra"]) # Replace "cassandra" with your host if different
session = cluster.connect()
# Create Keyspace
session.execute("""
CREATE KEYSPACE IF NOT EXISTS nces
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
""")
# Create Table
session.set_keyspace("nces")
session.execute("""
CREATE TABLE IF NOT EXISTS nces_income (
state_name TEXT PRIMARY KEY,
median_income_1990 DOUBLE,
median_income_2000 DOUBLE,
median_income_2005 DOUBLE,
std_error_2005 DOUBLE,
median_income_2010 DOUBLE,
std_error_2010 DOUBLE,
median_income_2015 DOUBLE,
std_error_2015 DOUBLE,
median_income_2016 DOUBLE,
std_error_2016 DOUBLE,
median_income_2017 DOUBLE,
std_error_2017 DOUBLE,
median_income_2018 DOUBLE,
std_error_2018 DOUBLE,
median_income_2019 DOUBLE,
std_error_2019 DOUBLE,
median_income_2020 DOUBLE,
std_error_2020 DOUBLE,
median_income_2021 DOUBLE,
std_error_2021 DOUBLE
);
""")
print("Keyspace and Table created successfully!")
# Close the connection
cluster.shutdown()
Keyspace and Table created successfully!
# Write DataFrame to Cassandra
nces_df.write \
.format("org.apache.spark.sql.cassandra") \
.mode("append") \
.options(table="nces_income", keyspace="nces") \
.save()
# read back from Cassandra
df3 =spark.read.format("org.apache.spark.sql.cassandra")\
.options(table="nces_income", keyspace="nces") \
.load()
df3.printSchema()
[Stage 2:> (0 + 8) / 8]
root |-- state_name: string (nullable = false) |-- median_income_1990: double (nullable = true) |-- median_income_2000: double (nullable = true) |-- median_income_2005: double (nullable = true) |-- median_income_2010: double (nullable = true) |-- median_income_2015: double (nullable = true) |-- median_income_2016: double (nullable = true) |-- median_income_2017: double (nullable = true) |-- median_income_2018: double (nullable = true) |-- median_income_2019: double (nullable = true) |-- median_income_2020: double (nullable = true) |-- median_income_2021: double (nullable = true) |-- std_error_2005: double (nullable = true) |-- std_error_2010: double (nullable = true) |-- std_error_2015: double (nullable = true) |-- std_error_2016: double (nullable = true) |-- std_error_2017: double (nullable = true) |-- std_error_2018: double (nullable = true) |-- std_error_2019: double (nullable = true) |-- std_error_2020: double (nullable = true) |-- std_error_2021: double (nullable = true)
And save locally.¶
df3.write.format("json").mode("overwrite").save("/home/jovyan/datasets/nces.json")
Go to Part III to see all 3 datasets saved to Elasticsearch and an example of the 2 dashboards generated.¶
from pyspark.sql import SparkSession
# Initialize Spark Session with Cassandra connector
spark = SparkSession.builder \
.appName("JoinChildcareAndNcesKeyspaces") \
.config("spark.cassandra.connection.host", "cassandra") \
.config("spark.cassandra.connection.port", "9042") \
.config("spark.jars.packages", "com.datastax.spark:spark-cassandra-connector_2.12:3.1.0") \
.getOrCreate()
# Manually list tables for childcare and nces keyspaces
childcare_tables = ["childcare_prices"]
nces_tables = ["nces_income"]
# Helper function to load a table from Cassandra as a Spark DataFrame
def load_table_as_dataframe(keyspace, table):
return spark.read \
.format("org.apache.spark.sql.cassandra") \
.options(keyspace=keyspace, table=table) \
.load()
# Load and filter tables with "state_name" from childcare keyspace
childcare_dfs = {}
for table in childcare_tables:
df = load_table_as_dataframe("childcare", table)
if "state_name" in df.columns:
childcare_dfs[table] = df
# Load and filter tables with "state_name" from nces keyspace
nces_dfs = {}
for table in nces_tables:
df = load_table_as_dataframe("nces", table)
if "state_name" in df.columns:
nces_dfs[table] = df
# Start joining tables by "state_name"
joined_df = None
# Join all DataFrames from childcare keyspace
for table, df in childcare_dfs.items():
if joined_df is None:
joined_df = df
else:
joined_df = joined_df.join(df, on="state_name", how="inner")
# Join with DataFrames from nces keyspace
for table, df in nces_dfs.items():
joined_df = joined_df.join(df, on="state_name", how="inner")
# Show the final joined DataFrame
if joined_df:
joined_pandas_df = joined_df.toPandas()
# Stop Spark session
spark.stop()
joined_pandas_df[joined_pandas_df['state_name'] == 'New Hampshire']
| state_name | county_name | county_fips_code | infant_center_based_price_2018_dollar | infant_center_based_price_2023_estimated_dollar | infant_center_based_price_as_share_of_family_income_2018_percent | infant_home_based_price_2018_dollar | infant_home_based_price_2023_estimated_dollar | infant_home_based_price_as_share_of_family_income_2018_percent | median_family_income_dollar | percent_asian_percent | percent_black_percent | percent_hispanic_of_any_race_percent | percent_of_families_in_poverty_percent | percent_white_percent | preschool_center_based_price_2018_dollar | preschool_center_based_price_2023_estimated_dollar | preschool_center_based_price_as_share_of_family_income_2018_percent | preschool_home_based_price_2018_dollar | preschool_home_based_price_2023_estimated_dollar | preschool_home_based_price_as_share_of_family_income_2018_percent | reference_sheetname | school_age_center_based_price_2018_dollar | school_age_center_based_price_2023_estimated_dollar | school_age_center_based_price_as_share_of_family_income_2018_percent | school_age_home_based_price_2018_dollar | school_age_home_based_price_2023_estimated_dollar | school_age_home_based_price_as_share_of_family_income_2018_percent | toddler_center_based_price_2018_dollar | toddler_center_based_price_2023_estimated_dollar | toddler_center_based_price_as_share_of_family_income_2018_percent | toddler_home_based_price_2018_dollar | toddler_home_based_price_2023_estimated_dollar | toddler_home_based_price_as_share_of_family_income_2018_percent | total_population | womens_labor_force_participation_rate_percent | womens_median_earnings_dollar | median_income_1990 | median_income_2000 | median_income_2005 | median_income_2010 | median_income_2015 | median_income_2016 | median_income_2017 | median_income_2018 | median_income_2019 | median_income_2020 | median_income_2021 | std_error_2005 | std_error_2010 | std_error_2015 | std_error_2016 | std_error_2017 | std_error_2018 | std_error_2019 | std_error_2020 | std_error_2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 999 | New Hampshire | Belknap County | 33001 | 10860.72 | 12783.633842 | 13.128862 | 7986.16 | 9400.126810 | 9.653982 | 82724.0 | 0.8 | 0.7 | 1.6 | 5.5 | 96.5 | 8966.88 | 10554.485396 | 10.839514 | 7986.16 | 9400.126810 | 9.653982 | NewHampshire | 8670.48 | 10205.607138 | 10.481215 | 7986.16 | 9400.126810 | 9.653982 | 10039.64 | 11817.179862 | 12.136309 | 7986.16 | 9400.126810 | 9.653982 | 60640 | 77.4 | 30781.0 | 76900.0 | 80700.0 | 78900.0 | 76000.0 | 80400.0 | 80100.0 | 81100.0 | 80900.0 | 82600.0 | NaN | 88500.0 | 840.0 | 870.0 | 870.0 | 980.0 | 1140.0 | 1020.0 | 1280.0 | NaN | 1300.0 |
| 1000 | New Hampshire | Carroll County | 33003 | 10914.80 | 12847.288822 | 15.219052 | 8025.68 | 9446.643911 | 11.190608 | 71718.0 | 0.7 | 0.5 | 1.4 | 4.8 | 97.2 | 9011.60 | 10607.123168 | 12.565325 | 8025.68 | 9446.643911 | 11.190608 | NewHampshire | 8713.64 | 10256.408709 | 12.149865 | 8025.68 | 9446.643911 | 11.190608 | 10089.04 | 11875.326238 | 14.067654 | 8025.68 | 9446.643911 | 11.190608 | 47840 | 74.3 | 26358.0 | 76900.0 | 80700.0 | 78900.0 | 76000.0 | 80400.0 | 80100.0 | 81100.0 | 80900.0 | 82600.0 | NaN | 88500.0 | 840.0 | 870.0 | 870.0 | 980.0 | 1140.0 | 1020.0 | 1280.0 | NaN | 1300.0 |
| 1001 | New Hampshire | Cheshire County | 33005 | 10904.40 | 12835.047480 | 13.635613 | 8017.88 | 9437.462904 | 10.026110 | 79970.0 | 1.4 | 0.8 | 1.8 | 5.3 | 95.3 | 9002.76 | 10596.718027 | 11.257672 | 8017.88 | 9437.462904 | 10.026110 | NewHampshire | 8705.32 | 10246.615635 | 10.885732 | 8017.88 | 9437.462904 | 10.026110 | 10079.68 | 11864.309030 | 12.604327 | 8017.88 | 9437.462904 | 10.026110 | 76263 | 78.4 | 26174.0 | 76900.0 | 80700.0 | 78900.0 | 76000.0 | 80400.0 | 80100.0 | 81100.0 | 80900.0 | 82600.0 | NaN | 88500.0 | 840.0 | 870.0 | 870.0 | 980.0 | 1140.0 | 1020.0 | 1280.0 | NaN | 1300.0 |
| 1002 | New Hampshire | Coos County | 33007 | 8698.04 | 10238.046695 | 14.266800 | 6395.48 | 7527.813493 | 10.490068 | 60967.0 | 0.6 | 2.9 | 2.9 | 8.4 | 93.9 | 7181.20 | 8452.646910 | 11.778831 | 6395.48 | 7527.813493 | 10.490068 | NewHampshire | 6943.56 | 8172.932237 | 11.389046 | 6395.48 | 7527.813493 | 10.490068 | 8040.24 | 9463.781790 | 13.187856 | 6395.48 | 7527.813493 | 10.490068 | 32038 | 78.1 | 25035.0 | 76900.0 | 80700.0 | 78900.0 | 76000.0 | 80400.0 | 80100.0 | 81100.0 | 80900.0 | 82600.0 | NaN | 88500.0 | 840.0 | 870.0 | 870.0 | 980.0 | 1140.0 | 1020.0 | 1280.0 | NaN | 1300.0 |
| 1003 | New Hampshire | Grafton County | 33009 | 11322.48 | 13327.149443 | 14.008983 | 8325.20 | 9799.194571 | 10.300533 | 80823.0 | 3.7 | 1.2 | 2.3 | 6.4 | 92.5 | 9348.04 | 11003.130594 | 11.566064 | 8325.20 | 9799.194571 | 10.300533 | NewHampshire | 9039.16 | 10639.562725 | 11.183896 | 8325.20 | 9799.194571 | 10.300533 | 10466.04 | 12319.074899 | 12.949334 | 8325.20 | 9799.194571 | 10.300533 | 89811 | 76.2 | 26663.0 | 76900.0 | 80700.0 | 78900.0 | 76000.0 | 80400.0 | 80100.0 | 81100.0 | 80900.0 | 82600.0 | NaN | 88500.0 | 840.0 | 870.0 | 870.0 | 980.0 | 1140.0 | 1020.0 | 1280.0 | NaN | 1300.0 |
| 1004 | New Hampshire | Hillsborough County | 33011 | 12736.88 | 14991.972006 | 13.286960 | 9365.20 | 11023.328809 | 9.769664 | 95860.0 | 4.1 | 2.8 | 6.5 | 5.4 | 89.7 | 10515.96 | 12377.833342 | 10.970123 | 9365.20 | 11023.328809 | 9.769664 | NewHampshire | 10168.08 | 11968.360440 | 10.607219 | 9365.20 | 11023.328809 | 9.769664 | 11773.32 | 13857.811635 | 12.281786 | 9365.20 | 11023.328809 | 9.769664 | 411087 | 77.8 | 33250.0 | 76900.0 | 80700.0 | 78900.0 | 76000.0 | 80400.0 | 80100.0 | 81100.0 | 80900.0 | 82600.0 | NaN | 88500.0 | 840.0 | 870.0 | 870.0 | 980.0 | 1140.0 | 1020.0 | 1280.0 | NaN | 1300.0 |
| 1005 | New Hampshire | Merrimack County | 33013 | 11683.88 | 13752.536091 | 12.989450 | 8590.92 | 10111.960869 | 9.550879 | 89949.0 | 2.1 | 1.3 | 2.1 | 4.6 | 94.2 | 9646.52 | 11354.457120 | 10.724433 | 8590.92 | 10111.960869 | 9.550879 | NewHampshire | 9327.24 | 10978.647909 | 10.369476 | 8590.92 | 10111.960869 | 9.550879 | 10799.88 | 12712.021989 | 12.006670 | 8590.92 | 10111.960869 | 9.550879 | 149452 | 78.7 | 31400.0 | 76900.0 | 80700.0 | 78900.0 | 76000.0 | 80400.0 | 80100.0 | 81100.0 | 80900.0 | 82600.0 | NaN | 88500.0 | 840.0 | 870.0 | 870.0 | 980.0 | 1140.0 | 1020.0 | 1280.0 | NaN | 1300.0 |
| 1006 | New Hampshire | Rockingham County | 33015 | 13253.76 | 15600.366722 | 12.290208 | 9745.32 | 11470.749872 | 9.036832 | 107840.0 | 2.0 | 0.8 | 2.9 | 3.0 | 94.8 | 10942.88 | 12880.340447 | 10.147329 | 9745.32 | 11470.749872 | 9.036832 | NewHampshire | 10580.96 | 12454.341732 | 9.811721 | 9745.32 | 11470.749872 | 9.036832 | 12251.72 | 14420.913385 | 11.361016 | 9745.32 | 11470.749872 | 9.036832 | 305129 | 81.4 | 35170.0 | 76900.0 | 80700.0 | 78900.0 | 76000.0 | 80400.0 | 80100.0 | 81100.0 | 80900.0 | 82600.0 | NaN | 88500.0 | 840.0 | 870.0 | 870.0 | 980.0 | 1140.0 | 1020.0 | 1280.0 | NaN | 1300.0 |
| 1007 | New Hampshire | Strafford County | 33017 | 11847.68 | 13945.337233 | 13.487335 | 8711.56 | 10253.960440 | 9.917193 | 87843.0 | 3.6 | 0.9 | 2.3 | 5.4 | 92.8 | 9781.72 | 11513.594571 | 11.135458 | 8711.56 | 10253.960440 | 9.917193 | NewHampshire | 9458.28 | 11132.888823 | 10.767255 | 8711.56 | 10253.960440 | 9.917193 | 10951.72 | 12890.745588 | 12.467379 | 8711.56 | 10253.960440 | 9.917193 | 128237 | 76.5 | 27104.0 | 76900.0 | 80700.0 | 78900.0 | 76000.0 | 80400.0 | 80100.0 | 81100.0 | 80900.0 | 82600.0 | NaN | 88500.0 | 840.0 | 870.0 | 870.0 | 980.0 | 1140.0 | 1020.0 | 1280.0 | NaN | 1300.0 |
| 1008 | New Hampshire | Sullivan County | 33019 | 10393.24 | 12233.385502 | 14.018587 | 7641.92 | 8994.938377 | 10.307557 | 74139.0 | 0.8 | 0.8 | 1.5 | 6.9 | 96.4 | 8581.04 | 10100.331593 | 11.574259 | 7641.92 | 8994.938377 | 10.307557 | NewHampshire | 8297.12 | 9766.142947 | 11.191303 | 7641.92 | 8994.938377 | 10.307557 | 9607.00 | 11307.940019 | 12.958092 | 7641.92 | 8994.938377 | 10.307557 | 43125 | 76.6 | 32496.0 | 76900.0 | 80700.0 | 78900.0 | 76000.0 | 80400.0 | 80100.0 | 81100.0 | 80900.0 | 82600.0 | NaN | 88500.0 | 840.0 | 870.0 | 870.0 | 980.0 | 1140.0 | 1020.0 | 1280.0 | NaN | 1300.0 |
import matplotlib.pyplot as plt
import numpy as np
# Filter the DataFrame for rows where state_name is 'New Hampshire'
new_hampshire_data = joined_pandas_df[joined_pandas_df['state_name'] == 'New Hampshire']
# Extract years and median incomes
years = np.array([1990, 2000, 2010, 2015, 2020, 2021])
median_incomes = np.array([
new_hampshire_data['median_income_1990'].values[0],
new_hampshire_data['median_income_2000'].values[0],
new_hampshire_data['median_income_2010'].values[0],
new_hampshire_data['median_income_2015'].values[0],
new_hampshire_data['median_income_2020'].values[0],
new_hampshire_data['median_income_2021'].values[0]
])
# Interpolation
interpolation_years = np.arange(2015, 2021) # Generate years 2015 through 2020
interpolated_incomes = np.interp(interpolation_years, years, median_incomes) # Interpolate between 2015 and 2020
# Plot the trends
plt.figure(figsize=(10, 6))
plt.plot(years, median_incomes, marker='o', label='Known Median Income', linestyle='-', color='blue')
plt.plot(interpolation_years, interpolated_incomes, linestyle='--', label='Interpolated Trend (2015-2020)', color='orange')
plt.title('Median Income Trends in New Hampshire (With Interpolation)', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Median Income ($)', fontsize=12)
plt.grid(True)
plt.legend()
plt.show()
Observations¶
1990 to 2000:
- There is a steady increase in median income, indicating economic growth and improved earning capacity in New Hampshire during this decade.
2000 to 2010:
- The median income shows a significant drop, which may align with broader economic trends like the Great Recession (2008). This suggests that households in New Hampshire were impacted by national economic challenges.
2010 to 2015:
- The median income continues to decline further, albeit at a slower pace compared to the previous decade. This could indicate a delayed recovery from the recession or other economic factors affecting the state.
2015 to 2020:
- There is a sharp recovery in median income, with income levels rebounding strongly. This might reflect an improving economy, increased job opportunities, and better wages during this period.
2020 to 2021:
- The income stabilizes or slightly dips, possibly reflecting the economic impact of the COVID-19 pandemic. This suggests some level of stagnation or challenges in maintaining economic growth during this period.
Observed Growth Pattern (2015 to 2020): The interpolated median income values between 2015 and 2020 indicate a steady increase in income levels during this period. In 2015, the median income starts at approximately 82,000, and by 2020, it rises to around 88,000, reflecting a 6-7% growth over five years.
Annual Growth Rate: Using the interpolated data, the estimated annual growth rate for median income in New Hampshire from 2015 to 2020 is approximately 1.2% per year.
import matplotlib.pyplot as plt
# Extract relevant columns for New Hampshire
new_hampshire_data = joined_pandas_df[joined_pandas_df['state_name'] == 'New Hampshire']
# Extract data for costs and median family income
infant_center_cost = new_hampshire_data['infant_center_based_price_2018_dollar'].values[0]
infant_home_cost = new_hampshire_data['infant_home_based_price_2018_dollar'].values[0]
median_family_income = new_hampshire_data['median_family_income_dollar'].values[0]
# Calculate childcare costs as a percentage of median family income
center_cost_share = (infant_center_cost / median_family_income) * 100
home_cost_share = (infant_home_cost / median_family_income) * 100
# Plot comparison
plt.figure(figsize=(8, 5))
plt.bar(['Center-Based Care', 'Home-Based Care'], [center_cost_share, home_cost_share], color=['blue', 'green'])
# Add labels and titles
plt.title('Infant Childcare Costs as a Percentage of Median Family Income in New Hampshire (2018)', fontsize=14)
plt.ylabel('Percentage of Median Income (%)', fontsize=12)
plt.ylim(0, __builtins__.max([center_cost_share, home_cost_share]) + 5) # Explicitly use Python's built-in max
plt.grid(axis='y', linestyle='--', alpha=0.7)
# Annotate values on bars
plt.text(0, center_cost_share + 1, f'{center_cost_share:.1f}%', ha='center', fontsize=12)
plt.text(1, home_cost_share + 1, f'{home_cost_share:.1f}%', ha='center', fontsize=12)
plt.show()
Observations
The bar graph illustrates Infant Childcare Costs as a Percentage of Median Family Income in New Hampshire for the year 2018, broken down into:
- Center-Based Care: Costs account for 13.1% of the median family income.
- Home-Based Care: Costs account for 9.7% of the median family income.
This comparison reveals that Center-Based Care is significantly more expensive, consuming a larger share of household income compared to Home-Based Care.
Income Growth vs. Childcare Costs:
Between 2015 and 2020, median income showed consistent growth, indicating improved financial capacity for families. However, childcare costs in 2018 remain substantial, suggesting that income growth has not fully offset the high costs of childcare.
Trend Alignment:
While median incomes are on an upward trajectory, the percentage of income spent on childcare demonstrates a gap between income growth and rising childcare expenses. If childcare costs continue to rise faster than income, this trend may become unsustainable for many families.
Insights
Center-Based Care is notably more expensive than Home-Based Care, requiring a larger share of household income. Despite the upward trend in income levels between 2015 and 2020, childcare costs remain a significant financial burden for families in New Hampshire. To address this issue, policymakers and stakeholders should consider strategies to reduce childcare costs or align them more closely with income growth to enhance affordability for families.